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Preface 


Goals 

The goal of this book is to give an overview of fundamental ideas and results about 
rational decision making under uncertainty, highlighting the implications of these 
results for the philosophy and practice of statistics. The book grew from lecture 
notes from graduate courses taught at the Institute of Statistics and Decision Sci¬ 
ences at Duke University, at the Johns Hopkins University, and at the University of 
Washington. It is designed primarily for graduate students in statistics and biostatis¬ 
tics, both at the Masters and PhD level. However, the interdisciplinary nature of the 
material should make it interesting to students and researchers in economics (choice 
theory, econometrics), engineering (signal processing, risk analysis), computer sci¬ 
ence (pattern recognition, artificial intelligence), and scientists who are interested in 
the general principles of experimental design and analysis. 

Rational decision making has been a chief area of investigation in a number of 
disciplines, in some cases for centuries. Several of the contributions and viewpoints 
are relevant to both the education of a well-rounded statistician and to the develop¬ 
ment of sound statistical practices. Because of the wealth of important ideas, and 
the pressure from competing needs in current statistical curricula, our first course in 
decision theory aims for breadth rather than depth. We paid special attention to two 
aspects: bridging the gaps among the different fields that have contributed to ratio¬ 
nal decision making; and presenting ideas in a unified framework and notation while 
respecting and highlighting the different and sometimes conflicting perspectives. 

With this in mind, we felt that a standard textbook format would be too con¬ 
straining for us and not sufficiently stimulating for the students. So our approach has 
been to write a “tour guide” to some of the ideas and papers that have contributed to 
making decision theory so fascinating and important. We selected a set of exciting 
papers and book chapters, and developed a self-contained lecture around each one. 
Some lectures are close to the source, while others stray far from their original inspi¬ 
ration. Naturally, many important articles have been left out of the tour. Our goal 
was to select a set that would work well together in conveying an overall view of the 
fields and controversies. 

We decided to cover three areas: the axiomatic foundations of decision theory; 
statistical decision theory; and optimal design of experiments. At many universities, 
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these are the subject of separate courses, often taught in different departments and 
schools. Current curricula in statistics and biostatistics are increasingly emphasizing 
interdisciplinary training, reflecting similar trends in research. Our plan reflects this 
need. We also hope to contribute to increased interaction among the disciplines by 
training students to appreciate the differences and similarities among the approaches. 

We designed our tour of decision-theoretic ideas so that students might emerge 
with their own overall philosophy of decision making and statistics. Ideally that phi¬ 
losophy will be the result of contact with some of the key ideas and controversies 
in the different fields. We attempted to put the contributions of each article into 
some historical perspective and to highlight developments that followed. We also 
developed a consistent unified notation for the entire material and emphasized the 
relationships among different disciplines and points of view. Most lectures include 
current-day materials, methods, and results, and try at the same time to preserve the 
viewpoint and flavor of the original contributions. 

With few exceptions, the mathematical level of the book is basic. Advanced 
calculus and intermediate statistical inference are useful prerequisites, but an enter¬ 
prising student can profit from most of the the book even without this background. 
The challenging aspect of the book lies in the swift pace at which each lecture 
introduces new and different concepts and points of view. 

Some lectures have grown beyond the size that can be delivered during a 11 hour 
session. Some others merge materials that were often taught as two separate lectures. 
But for the most part, the lecture-session correspondence should work reasonably 
well. The style is also closer to that of transcribed lecture notes than that of a treatise. 
Each lecture is completed by worked examples and exercises that have been helpful 
to us in teaching this material. Many proofs, easy and hard, are left to the student. 
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Introduction 


We statisticians, with our specific concern for uncertainty, are even more liable 
than other practical men to encounter philosophy, whether we like it or not. 

(Savage 1981a) 


1.1 Controversies 

Statistics is a mature field but within it remain important controversies. These con¬ 
troversies stem from profoundly different perspectives on the meaning of learning 
and scientific inquiry, and often result in widely different ways of interpreting the 
same empirical observations. 

For example, a controversy that is still very much alive involves how to evalu¬ 
ate the reliability of a prediction or guess. This is, of course, a fundamental issue 
for statistics, and has implications across a variety of practical activities. Many are 
captured by a case study on the evaluation of evidence from clinical trials (Ware 
1989). We introduce the controversy with an example. You have to guess a secret 
number. You know it is an integer. You can perform an experiment that would yield 
either the number before it or the number after it, with equal probability. You know 
there is no ambiguity about the experimental result or about the experimental answer. 
You perform this type of experiment twice and get numbers 41 and 43. What is the 
secret number? Easy, it is 42. Now, how good an answer do you think this is? Are 
you tempted to say “It is a perfect answer, the secret number has to be 42”? It turns 
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out that not all statisticians think this is so easy. There are at least two opposed 
perspectives on how to go about figuring out how good our answer is: 

Judge an answer by what it says. Compare the answer to other possible 
answers, in the light of the experimental evidence you have collected, 
which can now be taken as a given. 

versus 

Judge an answer by how it was obtained. Specify the rule that led you 
to give the answer you gave. Compare your answer to the answers that 
your rule would have produced when faced with all possible alternative 
experimental results. How far your rule is from the truth in this collection 
of hypothetical answers will inform you about how good your rule is. 
Indirectly, this will tell you how well you can trust your specific answer. 

Let us go back to the secret number. From the first perspective you would com¬ 
pare the answer “42” to all possible alternative answers, realize that it is the only 
answer that is not ruled out by the observed data, and conclude that there is no ambi¬ 
guity about the answer being right. From the second perspective, you ask how the 
answer was obtained. Let us consider a reasonable recipe for producing answers: 
average the two experimental results. This approach gets the correct answer half the 
time (when the two experimental results differ) and is 1 unit off the remainder of 
the time (when the two experiments yield the same number). Most measures of error 
that consider this entire collection of potential outcomes will result in a conclusion 
that will attribute some uncertainty to your reported answer. This is in sharp con¬ 
trast with the conclusion reached following from the first perspective. For example, 
the standard error of your answer is 1/V2. By this principle, you would write a 
paper reporting your discovery that “the secret number is 42 (s.e. 0.7)” irrespective 
of whether your data are 41 and 43, or 43 and 43. You can think of other recipes, but 
if they are to give you a single guess, they are all prone to making mistakes when the 
two experimental results are the same, and so the story will have the same flavor. 

The reasons why this controversy exists are complicated and fascinating. When 
things are not as clear cut as in our example, and multiple answers are compati¬ 
ble with the experimental evidence, the first perspective requires weighing them in 
some way—a step that often involves judgment calls. On the other hand the second 
perspective only requires knowing the probabilities involved in describing how the 
experiments relate to the secret number. For this reason, the second approach is per¬ 
ceived by many to be more objective, and more appropriate for scientific inquiry. 
Objectivity, its essence, worthiness, and achievability, have been among the most 
divisive issues in statistics. In an extreme simplification the controversy can be 
captured by two views of probability: 

Probability lives in the world. Probability is a physical property like 
mass or wavelength. We can use it to describe stochastic experimental 


INTRODUCTION 


3 


mechanisms, generally repeatable ones, like the assignments of exper¬ 
imental units to different conditions, or the measurement error of a 
device. These are the only sorts of probabilistic considerations that 
should enter scientific investigations. 

versus 

Probability lives in the mind. Probability, like most conceptual con¬ 
structs in science, lives in the context of the system of values and theories 
of an individual scientist. There is no reason why its use should be 
restricted to repeatable physical events. Probability can for example be 
applied to scientific hypotheses, or the prediction of one-time events. 

Ramsey (1926) prefaced his fundamental paper on subjective probability with a 
quote from poet William Blake: “Truth can never be told so as to be understood, and 
not be believed.” 

These attitudes define a coordinate in the space of statisticians’ personal philoso¬ 
phies and opinions, just like the poles of the previous controversy did. These two 
coordinates are not the same. For example, there are approaches to the secret num¬ 
ber problem that give different answers depending on whether data are 41 and 43, 
or 43 and 43, but do not make use of “subjective” probability to weigh alternative 
answers. Conversely, it is common practice to evaluate answers obtained from sub¬ 
jective approaches, by considering how the same approaches would have fared in 
other experiments. 

A key aspect that both these dimensions have in common is the use of a stochas¬ 
tic model as the basis for learning from data. In the secret number story, for example, 
the starting point was that the experimental results would fall to the left or right of 
the secret number with equal probability. The origin of the role of probability in 
interpreting experimental results is sampling. The archetype of many statistical the¬ 
ories is that experimental units are sampled from a larger population, and the goal of 
statistical inference is to draw conclusions about the whole population. A statistical 
model describes the stochastic mechanism based on which samples are selected from 
the population. Sometimes this is literally the case, but more often samples and pop¬ 
ulations are only metaphors to guide the construction of statistical procedures. While 
this has been the model of operation postulated in most statistical theory, in practical 
applications it is only one pole of yet another important controversy: 

Learning requires models. To rigorously interpret data we need to 
understand and specify the stochastic mechanism that generated them. 

The archetype of statistical inference is the sample-population situation. 

versus 

Learning requires algorithms. To efficiently learn from data, it is crit¬ 
ical to have practical tools for exploring, summarizing, visualizing, 
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clustering, classifying. These tools can be built with or without explicit 
consideration of a stochastic data-generating model. 

The model-based approach has ancient roots. One of the relatively recent land¬ 
marks is Fisher’s definition of the likelihood function (Fisher 1925). The algorithmic 
approach also goes back a long way in history: for example, most measures of 
dependence, such as the correlation coefficient, were born as descriptive, not infer¬ 
ential tools (Galton 1888). The increasing size and complexity of data, and the 
interface with computing, have stimulated much exploratory data analysis (Tukey 
1977, Chambers et al. 1983) and statistical work at the interface with artificial intel¬ 
ligence (Nakhaeizadeh and Taylor 1997, Hastie et al. 2003). This controversy is well 
summarized in an article by Breiman (2001). 

The premise of this book is that it is useful to think about these controversies, as 
well as others that are more technical in statistics, from first principles. The principles 
we will bring to bear are principles of rationality in action. Of course, this idea is in 
itself controversial. With this regard, the views of many statisticians distribute along 
another important dimension of controversy: 

Statisticians produce knowledge. The scope of statistics is to rigor¬ 
ously interpret experimental results, and present experimental evidence 
in an unbiased way to scientists, policy makers, the public, or whoever 
may be in charge of drawing conclusions or making decisions. 

versus 

Statisticians produce solutions to problems. Understanding data 
requires placing them in the context of scientific theories, which allow 
us to sort important from ancillary information. One cannot answer the 
question “what is important?” without first considering the question 
“important for what?” 

Naturally, producing knowledge helps solving problems, so these two positions 
are not in contrast from this standpoint. The controversy is on the extent to which 
the goals of an experiment should affect the learning approaches, and more broadly 
whether they should be part of our definition of learning. 

The best known incarnation of this controversy is the debate between Fisher and 
Neyman about the meaning of hypothesis tests (Fienberg 1992). The Neyman-Fisher 
controversy is broader, but one of the key divides is that the Neyman and Pearson 
theory of hypothesis testing considers both the hypothesis of interest and at least one 
alternative, and involves an explicit quantification of the consequences of rejecting or 
accepting the hypothesis based on the data: the type I and type II errors. Ultimately, 
Neyman and Pearson’s theory of hypothesis testing will be one of the key elements 
in the development of formal approaches to rationality-based statistical analysis. On 
the other hand Fisher’s theory of significance test does not require considering an 
alternative and incarnates a view of science in which hypotheses represent working 
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approximations to natural laws, that serve to guide experimentation, until they are 
refuted with sufficient strength that new theories evolve. 

A simple example challenges theories of inference that are based solely on 
the evidence provided by observation, regardless of scope and context of the 
theory. Let x t and x 2 be two Bernoulli trials. Suppose the experimenter’s proba¬ 
bilities are such that P{xi = 0) = P(x 2 = 0) = 0.5 and P(x ] + x 2 = 0) = 0.05. Then, 
P(x i +x 2 = 1) = 0.9 and P(x } +x 2 = 2) — 0.05. Let e be the new evidence that X\ = 1, 
let hi be the hypothesis that X\ + x 2 = 2, and h 2 be the hypothesis that x 2 = 1. Given 
e, the two hypotheses are equivalent. Yet, probability-wise, h t is corroborated by the 
data, whereas h 2 is not. So if one is to consider the change in probability as a measure 
of support for a theory, one would be left with either an inconsistent measure of evi¬ 
dence, or the need to defend the position that the two hypotheses are in some sense 
different even when faced with evidence that proves that they are the same. This and 
other similar examples seriously question the idea that inductive practice can be ade¬ 
quately represented by probabilities alone, without relation to their rational use in 
action. 

There can be disagreements of principle about whether consideration of con¬ 
sequences and beliefs belongs to scientific inquiry. In reality, though, it is our 
observation that the vast majority of statistical inference approaches have an implicit 
or explicit set of goals and values that guide the various steps of the construction. 
When making a decision as simple as summarizing a set of numbers by their median 
(as opposed to, say, their mean) one is making judgments about the relative impor¬ 
tance of the possible oversimplifications involved. These could be made formal, and 
in fact there are decision problems for which each of the two summaries is optimal. 
Our view is that scientific discussion is more productive when goals are laid out 
in the open, and perhaps formalized, than when they are hidden or unappreciated. 
As the old saying goes, “there are two types of statisticians: those who know what 
decision problem they are solving and those who don’t.” 

Despite the draconian simplifications we have made in defining the dimensions 
along which these four controversies unfold, one would be hard pressed to find two 
statisticians that live on the same point in this four-dimensional space. One of the 
goals of this book is to help students find their own spot in a way that reflects their 
personal intellectual values, and serves them best in approaching the theoretical and 
applied problems that are important to them. 

We definitely lean on the side of “judging answers by what they say” and believ¬ 
ing that “probabilities live in the mind.” We may be some distance away along the 
models versus algorithm dimension—at least judging by our approaches in appli¬ 
cations. But we are both enthusiastic about the value of thinking about rationality 
as a guide, though sometimes admittedly a rough guide, to science, policy, and 
individual action. This guidance comes at two levels: it tells us how to formally 
connect the tools of an analysis with the goals of that analysis; and it tells us how to 
use rationality-based criteria to evaluate alternative statistical tools, approaches, and 
philosophies. 

Overall, our book is an invitation to Bayesian decision-theoretic ideas. While 
we do not think they necessarily provide a solution to every statistical problem, we 
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find much to think about in this comment from Herman Chernoff (from a personal 
communication to Martin McIntosh): 

Frankly, I am not a Bayesian. I go under the following principle. If you 
don’t understand a problem from a Bayesian decision theory point of 
view, you don’t understand the problem and trying to solve it is like 
shooting at a target in the dark. Once you understand the problem, it is 
not necessary to attack it from a Bayesian point of view. Bayesian meth¬ 
ods always had the difficulty that our approximations to our subjective 
beliefs could carry a lot more information than we thought or felt willing 
to assume. 


1.2 A guided tour of decision theory 

We think of this book as a “tour guide” to the key ideas in decision theory. The 
book grew out of graduate courses where we selected a set of exciting papers and 
book chapters, and developed a self-contained lecture around each one. We make no 
attempt at being comprehensive: our goal is to give you a tour that conveys an overall 
view of the fields and its controversies, and whets your appetite for more. Like the 
small pictures of great paintings that are distributed on leaflets at the entrance of 
a museum, our chapters may do little justice to the masterpiece but will hopefully 
entice you to enter, and could guide you to the good places. 

As you read it, keep in mind this thought from R. A. Fisher: 

The History of Science has suffered greatly from the use by teachers 
of second-hand material, and the consequent obliteration of the circum¬ 
stances and the intellectual atmosphere in which the great discoveries of 
the past were made. A first-hand study is always useful, and often... full 
of surprises. (Fisher 1965) 

Our tour includes three parts: foundations (axioms of rationality); optimal data 
analysis (statistical decision theory); and optimal experimental design. 

Coherence. We start with de Finetti’s “Dutch Book Theorem” (de Finetti 1937) 
which provides a justification for the axioms of probability that is based on a simple 
and appealing rationality requirement called coherence. This work is the mathe¬ 
matical foundation of the “probabilities live in the mind” perspective. One of the 
implications is that new information is merged with the old via Bayes’ formula, 
which gets promoted to the role of a universal inference rule—or Bayesian inference. 

Utility. We introduce the axiomatic theory of utility, a theory on how to choose 
among actions whose consequences are uncertain. A rational decision maker pro¬ 
ceeds by assigning numerical utilities to consequences, and scoring actions by their 
expected utility. We first visit the birthplace of quantitative utility: Daniel Bernoulli’s 
St. Petersburg paradox (Bernoulli 1738). We then present in detail von Neumann and 
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Morgenstern’s utility theory (von Neumann and Morgenstern 1944) and look at a 
criticism by Allais (1953). 

Utility in action. We make a quick detour to take a look at practical matters of 
implementation of rational decision making in applied situations, and talk about how 
to measure the utility of money (Pratt 1964) and the utility of being in good health. 
For applications in health, we examine a general article (Torrance et al. 1972) and a 
medical article that has pioneered the utility approach in health, and set the standard 
for many similar analyses (McNeil et al. 1981). 

Ramsey and Savage. We give a brief digest of the beautiful and imposing axiomatic 
system developed by Savage. We begin by tracing its roots to the work of Ramsey 
(1931) and then cover Chapters 2, 3, and 5 from Savage’s Foundations of statistics 
(Savage 1954). Savage’s theory integrates the coherence story with the utility story, 
to create a more general theory of individual decision making. When applied to sta¬ 
tistical practice, this theory is the foundation of the “statisticians find solutions to 
problems” perspective. The general solution is to maximize expected utility, and 
expectations are computed by assigning personal probabilities to all unknowns. 
A corollary is that “answers are judged by what they say.” 

State independence. Savage’s theory relies on the ability to separate judgment of 
probability from judgment of utility in evaluating the worthiness of actions. Here we 
study an alternative axiomatic justification of the use of subjective expected utility 
in decision making, due to Anscombe and Anmann (1963). Their theory highlights 
very nicely the conditions for this separation to take place. This is the last chapter on 
foundations. 

Decision functions. We visit the birthplace of statistical decision theory: Wald’s 
definition of a general statistical decision function (Wald 1949). Wald proposed a 
unifying framework for much of the existing statistical theory, based on treating sta¬ 
tistical inference as a special case of game theory, in which the decision maker faces 
nature in a zero-sum game. This leads to maximizing the smallest utility, rather than a 
subjective expectation of utility. The contrast of these two perspectives will continue 
through the next two chapters. 

Admissibility. Admissibility is the most basic and influential rationality require¬ 
ment of Wald’s classical statistical decision theory. A nice surprise for Savage’s 
fans is that maximizing expected utility is a safe way, and often, at least approx¬ 
imately, the only way, to build admissible statistical decision rules. Nice. In this 
chapter we also reinterpret one of the milestones of statistical theory, the Neyman- 
Pearson lemma (Neyman and Pearson 1933), in the light of the far-reaching theory 
this lemma sparked. 

Shrinkage. The second major surprise from the study of admissibility is the fact that 
x —the motherhood and apple pie of the statistical world—is inadmissible in esti¬ 
mating the mean of a multidimensional normal vector of observations. Stein (1955) 
was the first to realize this. We explore some of the important research directions 
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that stemmed from Stein’s paper, including shrinkage estimation, empirical Bayes 
estimation, and hierarchical modeling. 

Scoring rules. We change our focus to prediction and explore the implications of 
holding forecasters accountable for their predictions. We study the incentive systems 
that must be set in place for the forecasters to reveal their information/beliefs rather 
than using them to game the system. This leads to the study of proper scoring rules 
(Brier 1950). We also define and investigate calibration and refinement of forecasters. 

Choosing models We try to understand whether statistical decision theory can be 
applied successfully to the much more elusive tasks of constructing and assessing 
statistical models. The jury is still out. On this puzzling note we close our tour of 
statistical decision theory and move to experimental design. 

Dynamic programming. We describe a general approach for making decisions 
dynamically, so that we can both learn from accruing knowledge and plan ahead to 
account for how present decisions will affect future decisions and future knowledge. 
This approach, called dynamic programming, was developed by Bellman (1957). We 
will try to understand why the problem is so hard (the “curse of dimensionality”). 

Changes in utility as information. In decision theory, the value of the information 
carried by a data set depends on what we intend to do with the data once we have 
collected them. We use decision trees to quantify this value (DeGroot 1984). We 
also explore in more detail a specific way of measuring the information in a data set, 
which tries to capture “generic learning” rather than specific usefulness in a given 
problem (Lindley 1956). 

Sample size. We finally come to terms with the single most common decision statisti¬ 
cians make in their daily activities: how big should a data set be? We try to understand 
how all the machinery we have been setting in place can help us and give some 
examples. Our discussion is based on the first complete formalization of Bayesian 
decision-theoretic approaches to sample size determination (Raiffa and Schlaifer 
1961). 

Stopping. Lastly, we apply dynamic programming to sequential data collection, 
where we have the option to stop an experiment after each observation. We discuss 
the stopping rule principle, which states that within the expected utility paradigm, 
the rule used to arrive at the decision to stop at a certain stage is not informative 
about parameters controlling the data-generating mechanism. We also study whether 
it is possible to design stopping rules that will stop experimentation only when one’s 
favorite conclusion is reached. 

A terrific preparatory reading for this book is Lindley (2000) who lays out the philos¬ 
ophy of Bayesian statistics in simple, concise, and compelling terms. As you progress 
through the book you will find, generally in each chapter’s preamble, alternative texts 
that dwell on individual topics in greater depth than we do. Some are also listed next. 
A large number of textbooks overlap with ours and we make no attempt at being com¬ 
prehensive. An early treatment of statistical decision theory is Raiffa and Schlaifer 
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(1961), a text that contributed enormously to defining practical Bayesian statistics 
and decision making in the earliest days of the field. Their book was exploring new 
territory on almost every page and, even in describing the simplest practical ideas, 
is full of deep insight. Ferguson (1967) is one of the early influential texts on sta¬ 
tistical decision theory, Bayes and frequentist. DeGroot (1970) has a more restricted 
coverage of Part Two (no admissibility) but a more extensive discussion of Part Three 
and an in-depth discussion of foundations, which gives a quite independent treatment 
of the material compared to the classical papers discussed in our book. A mainstay 
statistical decision theory book is Berger (1985) which covers topics throughout our 
tour. Several statistics books have good chapters on decision-theoretic topics. Excel¬ 
lent examples are Schervish (1995) and Robert (1994), both very rigorous and rich in 
insightful examples. Bernardo and Smith (1994) is also rich in foundational discus¬ 
sions presented in the context of both statistical inference and decision theory. French 
(1988), Smith (1987), and Bather (2000) cover decision-analytic topics very well. 
Kreps (1988) is an accessible and very insightful discussion of foundations, covered 
in good technical detail. A large number of texts in decision analysis, medical deci¬ 
sion making, microeconomics, operations research, statistics, machine learning, and 
stochastic processes cover individual topics. 



Part One 
Foundations 



2 


Coherence 


Decision making under uncertainty is about making choices whose consequences are 
not completely predictable, because events will happen in the future that will affect 
the consequences of actions taken now. For example, when deciding whether to play 
a lottery, the consequences of the decision will depend on the number drawn, which 
is unknown at the time when the decision is made. When deciding treatment for a 
patient, consequences may depend on future events, such as the patient’s response to 
that treatment. Political decisions may depend on whether a war will begin or end 
within the next month. In this chapter we discuss de Finetti’s justification, the first 
of its kind, for using the calculus of probability as a quantification of uncertainty in 
decision making. 

In the lottery example, uncertainty can be captured simply by the chance of a 
win, thought of, at least approximately, as the long-term frequency of wins over 
many identical replications of the same type of draw. This definition of prob¬ 
ability is generally referred to as frequentist. When making a prognosis for a 
medical patient, chances based on relative frequencies are still useful: for exam¬ 
ple, we would probably be interested in knowing the frequency of response to 
therapy within a population of similar patients. However, in the lottery example, 
we could work out properties of the relevant frequencies on the basis of plausi¬ 
ble approximations of the physical properties of the draw, such as independence 
and equal chance. With patients, that is not so straightforward, and we would 
have to rely on observed populations. Patients in a population are more different 
from one another than repeated games of the lottery, and differ in ways we may 
not understand. To trust the applicability of observed relative frequencies to our 
decision we have to introduce an element of judgment about the comparability of 
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the patients in the population. Finally, events like “Canada will go to war within 
a month” are prohibitive from a relative frequency standpoint. Here the element 
of judgment has to be preponderant, because it is not easy to assemble, or even 
imagine, a collection of similar events of which the event in question is a typical 
representative. 

The theory of subjective probability, developed in the 1930s by Ramsey and 
de Finetti, is an attempt to develop a formalism that can handle quantification of 
uncertainty in a wide spectrum of decision-making and prediction situations. In 
contrast to frequentist theories, which interpret probability as a property of phys¬ 
ical phenomena, de Finetti suggests that it is more useful and general to define 
probability as a property of decision makers. An intelligent decision maker may 
recognize and use probabilistic properties of physical phenomena, but can also 
go beyond. Somewhat provocatively, de Finetti often said that probability does 
not exist —meaning that it is not somewhere out there to be discovered, irre¬ 
spective of the person, or scientific community, trying to discover it. De Finetti 
posed the question of whether there could be a calculus for these more general 
subjective probabilities. He proposed that the axioms of probability commonly 
motivated by the frequency definition could alternatively be justified by a single 
rationality requirement now known as coherence. Coherence amounts to avoid¬ 
ing loss in those situations in which the probabilities are used to set betting 
odds. De Finetti’s proposal for subjective probability was originally published in 
1931 (de Finetti 1931b) and in English, in condensed form, in 1937 (de Finetti 
1937). 

In Section 2.1.1 we introduce the simple betting game that motivates the 
notion of coherence—the fundamental rationality principle underlying the theory. 
In Section 2.1 we present the so-called Dutch Book argument, which shows that an 
incoherent probability assessor can be made a sure loser, and establishes a connection 
between coherence and the axioms of probability. In Section 2.1.3, we will also show 
how to derive conditional probability, the multiplication rule, and Bayes’ theorem 
from coherence conditions. In Section 2.2 we present a temporal coherence theory 
(Goldstein 1985) that extends de Finetti’s to situations where personal or subjective 
beliefs can be revised over time. De Finetti’s 1937 paper is important in statistics 
for other reasons as well. For example, it was a key paper in the development of the 
notion of exchangeability. 


Featured article: 

de Finetti, B. (1937). Foresight: Its logical laws, its subjective sources, in 
H. E. Kyburg and H. E. Smokier (eds.), Studies in Subjective Probability, Krieger, 
New York, pp. 55-118. 

Useful general readings are de Finetti (1974) and Lindley (2000). For a compre¬ 
hensive overview of de Finetti’s contributions to statistics see Cifarelli and Regazzini 
(1996). 
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2.1 The “Dutch Book” theorem 

2.1.1 Betting odds 

De Finetti’s theory of subjective probability is usually described using the metaphor 
of betting on the outcome of an as yet unknown event, say a sports event. This is 
a stylized situation, but it is representative of many simple decision situations, such 
as setting insurance premiums. The odds offered by a sports bookmaker, or the pre¬ 
miums set by an insurance company, reflect their judgment about the probabilities 
of events. So this seems like a natural place to start thinking about quantification of 
uncertainty. 

Savage worries about a point that matters a great deal to philosophers and 
surprisingly less so to statisticians writing on foundations: 

The idea of facts known is implicit in the use of the preference theory. 

For one thing, the person must know what acts are available to him. If, 
for example, I ask what odds you give that the fourth toss of this coin will 
result in heads if the first three do, it is normally implicitly not only that 
you know I will keep my part of the bargain if we bet, but also that you 
will know three heads if you see them. The statistician is forever talking 
about what reaction would be appropriate to this or that set of data, or 
givens. Yet, the data never are quite given, because there is always some 
doubt about what we have actually seen. Of course, in any applications, 
the doubt can be pushed further along. We can replace the event of three 
heads by the less immediate one of three tallies-for-head recorded, and 
then take into our analysis the possibility that not every tally is correct. 
Nonetheless, not only universals, but the most concrete and individual 
propositions are never really quite beyond doubt. Is there, then, some 
avoidable lack of clarity and rigor in our allusion to known facts? It has 
been argued that since indeed there is no absolute certainty, we should 
understand by “certainty” only strong relative certainty. This counsel is 
provocative, but does seem more to point up, than to answer, the present 
question. (Savage 1981a, p. 512) 

Filing away this concern under “philosophical aches and pains,” as Savage would put 
it, let us continue with de Finetti’s plan. 

Because bookmakers (and insurance companies!) make a profit, we will, at least 
for now, dissect the problem so that only the probabilistic component is left. So 
we will look at a situation where bookmakers are willing to buy and sell bets at 
the same odds. To get rid of considerations that come up when one bets very large 
sums of money, we will assume, like de Finetti, that we are in a range of bets that 
involves enough money for the decision maker to take things seriously, but not big 
enough that aversion to potentially large losses may interfere. In the next several 
chapters we will discuss how to replace monetary amounts with a more abstract and 
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general measure, built upon ideas of Ramsey and others, that captures the utility to a 
decision maker of owning that money. For now, though, we will keep things simple 
and follow de Finetti’s assumptions. With regard to this issue, de Finetti (1937) notes 
that: 


Such a formulation could better, like Ramsey’s, deal with expected util¬ 
ities', I did not know of Ramsey’s work before 1937, but I was aware of 
the difficulty of money bets. I preferred to get around it by considering 
sufficiently small stakes, rather than to build up a complex theory to deal 
with it. (de Finetti 1937, p. 140) 

So de Finetti separates the derivation of probability from consideration of util¬ 
ity, although rationality, more broadly understood, is part of his argument. In later 
writings (de Finetti 1952, de Finetti 1964b), he discussed explicitly the option of 
deriving both utilities and probability from a single set of preferences, and seemed to 
consider it the most appropriate way to proceed in decision problems, but maintained 
that the separation is preferable in general, giving two reasons: 

First, the notion of probability, purified from the factors that affect utility, 
belongs to a logical level that I would call “superior”. Second, construct¬ 
ing the calculus of probability in its entirety requires vast developments 
concerning probability alone, (de Finetti 1952, p. 698, our translation) 

These “vast developments” begin with the notion of coherent probability assess¬ 
ments. Suppose we are interested in predicting the result of an upcoming tennis 
match, say between Fisher and Neyman. Bookmakers are generally knowledgeable 
about this, so we are going to examine the bets they offer as a possible way of quan¬ 
tifying uncertainty about who will win. Bookmakers post odds. If the posted odds 
in favor of Fisher are, say, 1:2, one can bet one dollar, and win two dollars if Fisher 
wins and nothing if he does not. In sports you often see the reciprocal of the odds, 
also called odds “against,” and encounter expressions like “Fisher is given 2:1” to 
convey the odds against. 

To make this more formal, let 9 be the indicator of the event “Fisher wins the 
game.” We say, equivalently, that 9 occurred or 9 is true or that the true value of 9 
is 1. A bet is a ticket that will be worth a stake S if 9 occurs and nothing if 9 does 
not occur. A bookmaker generally sells bets at a price jt e S. The price is expressed 
in units of the stake; when there is no ambiguity we will simply use jr. The ratio 
Tt : (1 — tc) is the betting odds in favor of the event 9. In our previous example, 
where odds in favor are 1:2, the stake S is three dollars, the price it Sis one dollar, and 
it is 1/3. 

The action of betting on 9, or buying the ticket, will be denoted by a s ,e . This 
action can be taken by either a client or the bookmaker, although in real life it is 
more often the client’s. What are the consequences of this action? The buyer will 
have a net gain of (1 — 7t)S, that is the stake S less the price jtS, if 9 = 1, or a net gain 
of —jtS, if 0 = 0. These net gains are summarized in Table 2.1. We are also going 
to allow for negative stakes. The action a_. SB , whose consequences are also shown 
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Table 2.1 Payoffs for the actions corresponding to buying 
and selling bets at stake S on event 9 at odds tc : (1 — i r). 


Action 


States of the world 



9 = 1 

O 

II 

Buy bet on 9 

a-s.e 

(1 -jt)S 

— 7tS 

Sell bet on 9 

a~s,e 

-(1 — 7r)5 

jtS 


in Table 2.1, reverses the role of the buyer and seller compared to a sf , . A stake of 
zero will represent abstaining from the bet. Also, action a_ SB (selling the bet) has by 
definition the same payoff as buying a bet on the event 6 = 0 at stake S and price 

7Ti_fl = 1 —7 Tg. 

We will work in the stylized context of a bookmaker who posts odds jr : (1 — tt) 
and is willing to buy or sell bets at those odds, for any stake. In other words, 
once the odds are posted, the bookmaker is indifferent between buying and sell¬ 
ing bets on 9, or abstaining. The expression we will use for this is that the odds 
are fair from the bookmaker’s perspective. It is implicitly assumed that the book¬ 
maker can assess his or her willingness to buy and sell directly. Assessing odds is 
therefore assumed to be a primitive, as opposed to derivative, way of expressing 
preferences among actions involving bets. We will need a notation for binary com¬ 
parisons among bets: U] ~ a 2 indicates indifference between two bets. For example, 
a bookmaker who considers odds on 9 to be fair is indifferent between a s e and 
a_s,o, that is a s ,e ~ a~s,e- Also, the symbol >- indicates a strict preference relation. 
For example, if odds in favor of 9 are considered too high by the bookmaker, then 
a s,e a -s,e- 

2.1.2 Coherence and the axioms of probability 

Before proceeding with the formal development, let us illustrate the main idea using a 
simple example. Lindley (1985) has a similar one, although he should not be blamed 
for the choice of tennis players. Let us imagine that you know a bookmaker who is 
willing to take bets on the outcome of the match between Fisher and Neyman. Say 
the prices posted by the bookmaker are 0.2 (1:4 odds in favor) for bets on the event 
9: “Fisher wins,” and 0.7 (7:3 odds in favor) for bets on the event “Neyman wins.” 
In the setting of the previous section, this means that this bookmaker is willing to 
buy or sell bets on 9 for a stake S at those prices. If you bet on 9, the bookmaker 
cashes 0.2S and then gives you back S (a net gain to him of —0.85) if Fisher wins, 
and nothing (a net gain of 0.25) if Fisher loses. In tennis there are no ties, so the 
event “Neyman wins” is the same as “Fisher loses” or 9 = 0. The bookmaker has 
posted separate odds on “Neyman wins,” and those imply that you can also bet on 
that, in which case the bookmaker cashes 0.75 and returns 5 (a net gain of —0.35) if 
Neyman wins and nothing (a net gain of 0.75) if Neyman loses. Let us now see what 
happens if you place both bets. Your gains are: 
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Fisher wins 

Neyman wins 

Bet 1 

0.85 

-0.25 

Bet 2 

-0.75 

0.35 

Both bets (total) 

0.15 

0.15 


So by placing both bets you can make the bookmaker lose money irrespective 
of whether Fisher or Neyman will win! If the stake is 10 dollars, you win one dollar 
either way. There is an internal inconsistency in the prices posted by the bookmaker 
that can be exploited to an economic advantage. In some bygone era, this used to be 
called “making Dutch Book against the bookmaker.” It is a true embarrassment, even 
aside from financial considerations. So the obvious question is “how to avoid Dutch 
Book?” Are there conditions that we can impose on the prices so that a bookmaker 
posting those prices cannot be made a sure loser? The answer is quite simple: prices 
71 have to satisfy the axioms of probability. This argument is known as the “Dutch 
Book theorem” and is worth exploring in some detail. 

Avoiding Dutch Book is the rationality requirement that de Finetti had in mind 
when introducing coherence. 

Definition 2.1 (Coherence) A bookmaker’s betting odds are coherent if a client 
cannot place a bet ora combination of bets such that no matter what outcome occurs, 
the bookmaker will lose money. 

The next theorem, due to de Finetti (1937), formalizes the claim that coherence 
requires prices that satisfy the axioms of probabilities. For a more formal develop¬ 
ment see Shimony (1955). The conditions of the theorem, in the simple two-event 
version, are as follows. Consider disjoint events and 0 2 , and assume a bookmaker 
posts odds (and associated prices) on all the events in the algebra induced by these 
events, that is on 1 — ©, 0 l ,O 2 , (1 — 0i)(l — 0 2 ), 0i + 0 2 ,1 — 6\, 1 — 0 2 , ©, where 0 is 
the indicator of the sure event. As we discussed, there are two structural assumptions 
being made: 

DBT1: The odds are fair to the bookmaker, that is the bookmaker is willing to 
both sell and buy bets on any of the events posted. 

DBT2: There is no restriction about the number of bets that clients can buy or 
sell, as long as this is finite. 

The first condition is required to guarantee that the odds reflect the bookmaker’s 
knowledge about the relevant uncertainties, rather than desire to make a profit. The 
second condition is used, in de Finetti’s words, to “purify” the notion of probability 
from the factors that affect utility. It is strong: it implies, for example, that the book¬ 
maker values the next dollar just as much as if it were his or her last dollar. Even 
with this caveat, this is a very interesting set of conditions for an initial study of the 
rational underpinnings of probability. 





COHERENCE 


19 


Theorem 2.1 (Dutch Book theorem) If DBT1 and DBT2 hold, a necessary 
condition for a set of prices to be coherent is to satisfy Kolmogorov’s axioms, that is: 


Axiom 1 0 < ttg < 1 , for every 0 . 

Axiom 2 7r@ = 1. 

Axiom 3 If 9 { and 0 2 are such that 0!0 2 = 0, then jt 01 + ttg 2 — tt 01 +02 . 


Proof: We assume that the odds are fair to the bookmaker, and consider the gain g e 
made by a client buying bets from the bookmaker on event 9. If the gain is strictly 
positive for every 9 in a partition, then the bookmaker can be made a sure loser and 
is incoherent. 

Axiom 1 : Suppose, by contradiction, that > 1. When S < 0, the gain g„ to a 
client is 


_j(l-jr e )S if <9 = 1 
8e ~\-7t e S if 0 = 0 

which is strictly positive for both values of 9. Similarly, jr 9 < 0 and S > 0 also imply 
a sure loss. 

Axiom 2: Let us assume that Axiom 1 holds. Say, for a contradiction, that 0 < 
7r 0 < 1. For any S > 0 the gain g 0 is (1 — n@)S > 0, if 0 = 1. Because © = 1 by 
definition, this implies a sure loss. 

Axiom 3: Let us consider separate bets on 0 1 ; 0 2 , and 0 3 = 0 ! + 0 2 — 9i9 2 = 
9 1 + 9 2 . 0 3 is the indicator of the union of the two events represented by 0 ! and 0 2 . 
Say stakes are S Sl ,Sg 2 , and S Bi , respectively. Consider the partition given by 0 t , 0 2 , 
and (1 — 00(1 — 0 2 ). The net gains to the client in each of those cases are 


Ho I ft3 {/1 "j~ Itfj 2 "F tt()■, S B2 ) 

go 2 = Se 2 + Se 3 — (ttg 1 Sg 1 + Jte 2 Se 2 + TCo 3 S B f) 

S( 1—fli)(i—fl 2 ) = ~(ttg l S 0 j + JCe 2 Ss 2 + 7tg 3 Sg 3 ). 

These three equations can be rewritten in matrix notation as Rs = g, where 


g — (gt),’ge 2 ’g( 1—0! XI—6»2>) 
s = (S B j, Sg 2 ,Sg 3 ) 

( 1 — Ttg J — ttg 2 1 — Ttg 2 

~ttg 1 1 — 7 Xg 2 1 — Ttg 2 

— ttg j — ttg 2 —ttg 3 

If the matrix R is invertible, the system can be solved to get s = R l g. This means 
that a client can set g to be a vector of positive values, corresponding to losses for the 
bookmaker. Thus coherence requires that the matrix R be singular, that is \R\ = 0, 
which in turn implies, after a little bit of algebra, that jr ei + 7r e , — jtg 2 = 0. □ 
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We stated and proved this theorem in the simple case of two disjoint events. This 
same argument can be extended to an arbitrary finite set of disjoint events, as done in 
de Finetti (1937). From a mathematical standpoint it is also possible, and standard, to 
define axioms in terms of countable additivity, that is additivity over a denumerable 
partition. This also permits us to talk about coherent probability distributions on both 
discrete and continuous random variables. 

The extension of coherence results from finite to denumerable partitions is con¬ 
troversial, as some find it objectionable to state a rationality requirement in terms of 
a circumstance as abstract as taking on an infinite number of bets. In truth, the theory 
as stated allows for any finite number of bets, so this number can easily be made to 
be large enough to be ridiculously unrealistic anyway. But there are other reasons as 
well why finite-bets theory is fun to explore. Seidenfeld (2001) reviews some differ¬ 
ences between the countably additive theory of probability and the alternative theory 
built solely using finitely additive probability. 

A related issue concerns events of probability zero more generally. Shimony 
(1955), for example, has criticized the coherence condition we discussed in this chap¬ 
ter as too weak, and prefers a stricter version that would not consider it rational to 
choose a bet whose return is never positive and sometimes negative. This version 
implies that no possible event can have probability zero—a requirement sometimes 
referred to a “Cromwell’s Rule” (Lindley 1982a). 

2.1.3 Coherent conditional probabilities 

In this section we present a second Dutch Book theorem that applies to coherence 
of conditional probabilities. The first step in de Finetti’s development is to define 
conditional statements. These are more general logical statements based on a three¬ 
valued logic: statement A conditional on B can be either true, if both are true, or 
false, if A is false and B is true, or void if B is false. In betting terminology this idea 
is operationalized by the so-called “called-off” bets. A bet on 0 , , with stake S at a 
price 7r, called off if 0 2 does not occur, means buying at price jtS a ticket worth the 
following. If 0 2 does not occur the price jtS will be returned. If 0 2 occurs the ticket 
will be worth S if 0, occurs and nothing if 0 , does not occur, as usual. We denote by 
7 tQ l | 0 2 the price of this bet. The payoff is then described by Table 2.2. Under the same 
structural conditions of Theorem 2.1, we have that: 

Theorem 2.2 (Multiplication rule) A necessary condition for coherence of prices 
of called-off bets is that n ei \g 2 7Xe 2 = 7t Sl g 2 . 


Table 2.2 Payoffs corresponding to buying a bet on 6 { — 1, called off it 
0 2 = 0 . 


Action 

States of the world 


o 

II 

<N 

II 

£ 

oT 

1 

II 

oT 

bet on 0 !, called off if 0 2 = 0 

(1 — TT eiVl )S —TT ei \g 2 S 0 
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Proof: Consider bets on 0 2 , 0,0 2 , and 0 l \0 2 with stakes Sg 2 , S Sl g 2 , and S 9l |„ 2 , 
respectively, and the partition 0i0 2 ,{\ — O l )0 2 , (1 — 0 2 ). The net gains are 



g\-e 2 — ~(^e 2 ^e 2 + '^e l e 2 Sg l g 1 + 0). 

These three equations can be rewritten in matrix notation as Rs = g, where 


-ei)e 2 -> gi-e 2 ) 
S = (Sg 2 , Sg^, Sg l io 2 )’ 



The requirement of coherence implies that \R\ = 0, which in turn implies that 


□ 


T,: e l \e 2 ^e 2 ~ ^e l e 2 — 0 . 


At this point we have at our disposal all the machinery of probability calculus. 
For example, a corollary of the law of total probability and the conditioning rule is 
Bayes’ rule. Therefore, coherent probability assessment must also obey the Bayes 
rule. 

If we accept countable additivity, we can use continuous random variables and 
their properties. One can define a parallel set of axioms in terms of expectations of 
random variables that falls back on the case we studied if the random variables are 
binary. An important case is that of conditional expectations. If 0 is any continuous 
random variable and 0 2 is an event, then the “conditional random variable” 0 given 
0 2 can be defined as 


0\0 2 = 00 2 + (l-0 2 )E[0\0 2 \ 

where 0 is observed if 0 2 occurs and not otherwise. Taking expectations, 


E[0\0 2 ) = E[09 2 ] + (l - 7Tg 2 )E[0\0 2 ] 

and solving gives 



( 2 . 1 ) 


Xe 2 

We will use this relationship in Section 2.2. 


2.1.4 The implications of Dutch Book theorems 

What questions have we answered so far? We definitely answered de Finetti’s origi¬ 
nal one, that is: assuming that we desire to use probability to represent an individual’s 
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knowledge about unknowns, is there a justification for embracing the axioms of prob¬ 
ability, as stated a few years earlier by Kolmogorov? The answer is: yes, there is. If 
the axioms are not satisfied the probabilities are “intrinsically contradictory” and lead 
to losing in a hypothetical game in which they are put to the test by allowing others 
to bet at the implied odds. 

The laws of probability are, in de Finetti’s words, “conditions which characterize 
coherent opinions (that is, opinions admissible in their own right) and which distin¬ 
guish them from others that are intrinsically contradictory.” Within those constraints, 
a probability assessor is entitled to any opinion. De Finetti continues: 


a complete class of incompatible events 0 U 0 2 ,..., 0 n being given, all 
the assignments of probability that attribute to tz u 7r2,..., 7T„ any val¬ 
ues whatever, which are non-negative and have a sum equal to unity, 
are admissible assignments: each of these evaluations corresponds to a 
coherent opinion, to an opinion legitimate in itself, and every individual 
is free to adopt that one of these opinions which he prefers, or, to put in 
more plainly, that which h s feels. The best example is that of a champi¬ 
onship where the spectator attributes to each team a greater or smaller 
probability of winning according to his own judgment; the theory cannot 
reject a priori any of these judgments unless the sum of the probabilities 
attributed to each team is not equal to unity. This arbitrariness, which 
any one would admit in the above case, exists also, according to the con¬ 
ception which we are maintaining, in all other domains, including those 
more or less vaguely defined domains in which the various objective 
conceptions are asserted to be valid, (de Finetti 1937, pp. 139-140) 


The Dutch Book argument provides a calculus for using subjective probability 
in the quantification of uncertainty and gives decision makers great latitude in estab¬ 
lishing fair odds based on formal or informal processing of knowledge. With this 
freedom comes two important constraints. One is that probability assessors be ready, 
at least hypothetically, to “put their money where their mouth is.” Unless ready to 
lie about their knowledge (we will return to this in Chapter 10), the probability 
assessor does not have an incentive to post capricious odds. The other is implicit 
in de Finetti’s definition of event as a statement whose truth will become known to 
the bettors. That truth of events can be known and agreed upon by many individuals 
in all relevant scientific contexts is somewhat optimistic, and reveals the influence of 
the positivist philosophical school on de Finetti’s thought. From a statistical stand¬ 
point, though, it is healthy to focus controversies on observable events, rather than 
theoretical entities that may not be ultimately measured, such as model parameters. 
The latter, however, are important in a number of scientific settings. De Finetti’s the¬ 
ory of exchangeability, also covered in his 1937 article, is a formidable contribution 
to grounding parametric inference in statements about observables. Covering it here 
would takes us too far astray. A good entry point to the extensive literature is Cifarelli 
and Regazzini (1996). 
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A second related question is the following: is asking someone for their fair betting 
odds a good way to find out their probability of an event? Or, stated more technically, 
is the betting mechanism a practical elicitation tool for measuring subjective prob¬ 
ability? The Dutch Book argument does not directly mention this, but nonetheless 
this is an interesting possibility. Discussions are in de Finetti (1974), Kadane and 
Winkler (1988), Seidenfeld et al. (1990a) and Garthwaite et al. (2005) who connect 
statistical considerations to the results of psychological research about “how people 
represent uncertain information cognitively, and how they respond to questions about 
that information.” 

A third question is perhaps the most important: do all rational decision mak¬ 
ers facing a decision under uncertainty (say the betting problem) have to act as 
though they represented their uncertainty using a coherent subjective probability 
distribution? There is a little bit of extra work that needs to be done before we 
can answer that, and we will postpone it to our discussion of Ramsey’s ideas in 
Chapter 5. 

Lastly, we need to consider the question of temporal coherence. We have seen 
that the Bayes rule and conditional probabilities are derived in terms of called-off 
bets, which are assessed before the conditioning events are observed. As such they 
are static constraints among probabilities of events, all of which are in the future. 
Much of statistical thinking is about what can be said about unknowns after some 
data are observed. Ramsey (1926, p. 180) first pointed out that the two are not the 
same. Hacking (1976) draws a distinction between conditional probability and a pos¬ 
teriori probability, the latter being the statement made after the conditioning event is 
observed. The dominant view among Bayesian statistician has been that the two can 
be equated without resorting to any additional principle. For example, Howson and 
Urbach (1989) argue that unless relevant additional background knowledge accrues 
between the time the conditional probability is stated and the time the condition¬ 
ing event occurs, it is legitimate to equate conditional probability and a posteriori 
probability. And one can often make provisions for this background knowledge by 
incorporating it explicitly in the algebra being considered. 

Others, however, have argued that the leap to using the Bayes rule for a posteriori 
probability is not justified by the Dutch Book theorem. Goldstein writes: 

As no coherence principles are used to justify the equivalence of con¬ 
ditional and a posteriori probabilities, this assumption is an arbitrary 
imposition on the subjective theory. As Bayesians rarely make a simple 
updating of actual prior probabilities to the corresponding conditional 
probabilities, this assumption misrepresents Bayesian practice. Thus 
Bayesian statements are often unclear.... The practical implication is 
that Bayesian theory does not appear to be very helpful in considering the 
kind of question that we have raised about the expert and his changing 
judgments. (Goldstein 1985, p. 232) 

There are at least a couple of options for addressing this issue. One is to add to the 
coherence principle the separate principle, taken at face value, that conditional and a 
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posteriori probabilities are the same. This is sometimes referred to as the conditional¬ 
ity principle. For example, Pratt et al. (1964) hold that conditional probability before 
and after the conditioning event are two different behavioral principles, though in 
their view, equating the two is a perfectly reasonable additional requirement. We 
will revisit this in Chapter 5. Another, which we briefly examine next, is to formal¬ 
ize a more general notion of coherence that would apply to the dynamic nature of 
updating. 


2.2 Temporal coherence 

Let us start with a simple example. You are about to go to work on a cloudy morning 
and your current degree of belief about the event 6 that it will rain in the afternoon 
is 0.9. If you ask yourself the same question at lunchtime you may state a different 
belief perhaps because you hear the weather forecast on the radio, or because you 
see a familiar weather pattern develop. We will denote by tt° the present assessment, 
and by TtJ the lunchtime assessment. The quantity ttJ is unknown at the present time 
and, therefore, it can be thought of as a random quantity. 

Goldstein describes the dynamic nature of beliefs. Your beliefs, he says, are: 

temporally coherent at a particular moment if your current assessments 
are coherent and you also believe that at each future time point your new 
current assessments will be coherent. (Goldstein 1985, p. 232) 

To make this more formal, one needs to think about degree of belief about future 
probability assessments. In our example, we would need to consider beliefs about 
our future probability ttJ. We will denote the expected value, computed at time 0 
of this probability, by E°[7 tJ]. Goldstein (1983) proposes that in order for one to be 
considered coherent over time, his or her expectation for an event’s prevision ought 
to be his or her current probability of that event. 

Definition 2.2 (Temporal coherence) Probability assessments on event 9 at two 
time points 0 and T are temporally coherent iff 

E°[jrl ] = JT°. (2.2) 

This condition establishes a relation between one’s current assessments and those 
to be asserted at a future time. Also, this relation assures that one’s change in pre¬ 
vision, that is 7rJ — n°, cannot be systematically predicted from current beliefs. In 
Goldstein’s own words: 

I have beliefs. I may not be able to describe “the precise reasons” why I 
hold these beliefs. Even so, the rules implied by coherence provide logi¬ 
cal guidance for my expression of these beliefs. My beliefs will change. 

I may not now be able to describe precise reasons how or why these 
changes will occur. Even so, just as with my beliefs, the rules implied 
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by coherence provide logical guidance for my expression of belief as to 
how these beliefs will change. (Goldstein 1983, p. 819) 

The following theorem, due to Goldstein (1985), extends the previous result to 
conditional previsions. 

Theorem 2.3 If you are temporally coherent, then your previsions must satisfy 

Ift] = <|9 2 

where jtJ ^ is the revised value for: Vh declared at time T, still before d 2 is obtained. 

Proof: Consider the definition of conditional expectation given in equation (2.1). If 
we choose 9 to be the random 7rjj ^, then 

E°[7rf iie2 \e 2 ] = £°[< |(i <y/<. 

Next, applying the definition of temporal coherence (2.2) with 9 = and 

substituting, we get 

£°[< |fl2 |0 2 ]=£VIr |82 J/<. 

At time T, jr^ is constant, so 

£°K, fl2 102] = £°[< l% <]/< 

= £°[^9 2 ]/< 

front the definition of conditional probability. Finally, from temporal coherence, 

£°Kfl 2 ] = rt° w so 




l ei\e 2 - 


□ 


Consider events 9 t ,i = 1,..., k, forming a partition of ©; that is, one and only 
one of them will occur. Formally Y^i=i - I — ' and 9,0, = 0, for all i ^ j. Also, choose 
9 to be any event in ©, not necessarily one of the 0, above. A useful consequence of 
Theorem 2.3 above is that 


tt, 


si© 


e _ y ' 9i7t g \ 9 .. 


Similarly, at a future time T , 


■^eie — y ' 9iJV e | 9 .. 


Our next goal is to investigate the relationship between jr® |0 and i rjj @ . 
The next theorem establishes that 7r° |@ cannot systematically predict the change 
Q = nf — 7r® in the prevision of conditional beliefs. 
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Theorem 2.4 If an agent is temporally coherent, then 

(i) 7 r° = 0, 

(ii) Q and 7r°| 0 are uncorrelated, 

(iii) 7Tg| 0 is identically zero. 

Proof: Following Goldstein (1985) we will prove (iii) by showing that tXq^. = 0 for 
every i: 


= £ °[<e - K@\ e i\ 


e|e -"-e |© i 


= E° 


= E U 


-J2 e J n e\ej\ e ‘ 




E°[(trL - 


= E?[n T m - 7 i° m \ Q t ] 
= 0. 


Summing over i gives the desired result. □ 

Although the discussion in this section utilizes previsions of event indica¬ 
tors (and, therefore, probabilities of events), the arguments hold for more general 
bounded random variables. In this general framework, one can directly assess the 
prevision or expectation of the random variable, which is only done indirectly here, 
via the event probabilities. 


2.3 Scoring rules and the axioms of probabilities 

In Chapter 10 we will study measures used for the evaluation of forecasters, after 
the events that were being predicted have actually taken place. These are called scor¬ 
ing rules and typically involve the computation of a summary value that reflects the 
correspondence between the probability forecast and the observation of what actually 
occurred. 

Consider the case where a forecaster must announce probabilities ji = 
(jte l ,7te 1 ,... ,7te k ) for the events 0 1; ... ,Q k , which form a partition of the possible 
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states of nature. A popular choice of scoring rule is the negative of the sum of the 
squared differences between predictions and events: 


k 



(2.3) 


i=l 


It turns out that the score .s' above of an incoherent forecaster can be improved upon 
regardless of the outcome. This can provide an alternative justification for using 
coherent probabilities, without relying on the Dutch Book theorem and the bet¬ 
ting scenario. This argument is in fact used in de Finetti’s treatise on probability 
(de Finetti 1974) to justify the axioms of probability. In Section 2.4 we show how 
to derive Kolmogorov’s axioms when the forecaster is scored on the basis of such a 
quadratic scoring rule. 

2.4 Exercises 

Problem 2.1 Consider a probability assessor being evaluated according to the scor¬ 
ing rule (2.3). If the assessor’s probabilities it violate any of the following two 
conditions, 

1. 0 < 7 T gj < 1 for all j — 1 ,,k. 



then there is a vector it', satisfying conditions 1 and 2, and such that s(9j,it) < 
s(Qj, it') for all j = 1 ,... ,k and s(6j, it) < s(Qj, it') for at least one j. 

We begin by building intuition about this result in low-dimensional cases. When 
k = 2, in the (iz 6l , tc 0i ) plane the quantities s(di,it) and s(0 2 , it) represent the negative 
of the squared distances between the point it = (tt,,, , tv /I2 ) and the points e, = (1,0) 
and e 2 = (0,1), respectively. In Figure 2.1, points C and D satisfy condition 2 
but violate condition 1, while point B violates condition 2 and satisfies condition 
1. Can we find values which satisfy both conditions and have smaller distances to 
both canonical vectors e, and e 2 l In the cases of C and D, e x and e 2 do the job. 
For B, we can find a point b that does the job by looking at the projection of B 
on the + tv„ 2 = I line. The scores of B are hypotenuses of the Be , h and Be 2 b 

triangles. The scores of the projection are the sides of the same triangles, that is 
the line segments e x b and e 2 b. The point E violates both conditions. If you are not 
yet convinced by the argument, try to find a point that does better than E in both 
dimensions. 

Figure 2.2 illustrates a similar point for k = 3. In that case, to satisfy the axioms, 
we have to choose points in the shaded area (the simplex on 9f 3 ). This result can be 
generalized to SR* by using the triangle inequality in k dimensions. We now make this 
more formal. 
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Figure 2.1 Quadratic scoring rule for k = 2. Point B satisfies the first axiom of 
probability and violates the second one, while points C and D violate the first axiom 
and satisfy the second. Point E violates them both, while points b.e i, and e 2 satisfy 
both axioms. 



Figure 2.2 Quadratic scoring rule for k = 3. 
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We begin by considering a violation of condition 1. Suppose that the first com¬ 
ponent of n is negative, that is 7r fl| < 0, and 0 < it Sj < 1 for j = 2,... ,k. Consider 
now it' constructed by setting jt' 9 = 0 and it' g . = it ej for j = 2 ,... ,k. it' satisfies 
condition 1. If event 6i occurs, 

k 

s(d U 7t) - s(6i,Jt') = - y [Ofy - 0 i) 2 - (K t ~ 0 i) 2 ] 

i= 1 

= - - l) 2 - (n' ei - l) 2 ] - ~ »*) 

i =2 

= -Kite, - l) 2 - 1] < 0 
since <0. Furthermore, for j = 2,... ,k, 

k 

s{0j , it) - s(dj,it') = -(tt 2 - it g 2 ) - y (*1 ~ 

*=2mW 

- [(JTsj - l) 2 - (TT- l) 2 j 

= < o. 

Therefore, if jr fll < 0, 5(6^, jr) < 5(6^, it') for every j = 1 ,... ,k. 

We now move to the case where the predictions violate condition 2, that is 0 < 
itg. < 1 for i — 1,..., k, but Y2=i 7T e i 7^ 1- Take it' to be the orthogonal projection of 
it on the plane defined to satisfy both conditions 1 and 2. For any j we have 


s(0j,it) — s(6j,it') 


y n l + (7r «t - 1)2 

j=u?y 


y A + 

-'= 1 . iW 


The term 

■ * 

y, + ( n oj— 1)‘ 

_>=l. 'W 

corresponds to the squared Euclidean distance between it and e n the /.'-dimensional 
canonical point with value 1 in its 7th coordinate, and 0 elsewhere. Similarly, the term 


y Jr ' 2 + (it'. - l ) 2 


.i=U# 


is the squared Euclidean distance between it' and e r Since it' is an orthogonal pro¬ 
jection of it and satisfies conditions 1 and 2, it follows that ||jr,jr'|| + ||jr',e,j| = 
11 7r , 11, for any j = \..... k. Here | jr 1, 7r 2 | denotes the squared Euclidean distance 

between it\ and it 2 . As ||jr,jr'|| > 0, with ||jr,7r'|| = 0 if, and only if, it = it', 
we conclude that 117r',e y -|| < ||jr,e,-||. Therefore s(dj,it) — s(6j,it') < 0 for any 

j = 
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Problem 2.2 We can extend this result to conditional probabilities by a slight 
modification of the scoring rule. We will do this next, following de Finetti (1972, 
chapter 2) (English translation of de Finetti 1964). Suppose that a forecaster must 
announce a probability TT ei \e 2 for an event 9 x conditional on the occurrence of the 
event 0 2 , and is scored by the rule — fag^ — 9 l ) 2 0 2 . This implies that a penalty of 
—(jT ei — 0] f occurs if 0 2 is true, but the score is 0 otherwise. Suppose the forecaster 
is also required to announce probabilities n eiSl and jtg 2 for the events 0\0 2 and 0 2 
subject to quadratic scoring rules — fag l g 2 — 0i& 2 ) 2 and — fa g ^ — 0 2 ) 2 , respectively. 
Overall, by announcing 7r e , , 7r fllfl ,, and n ei the forecaster is subject to the penalty 

S( 9 l |#2> 01$2> $2> 7 T©! |6»2» = — fao x \ 8 2 — $l) 2 $2 ~ fae iB 2 ~ Si 0 2 )~ — fag 2 — 0 2 )~ . 

Under these conditions jr^^, n ei g 2 , and tt 02 must satisfy 


— ttg l \g 1 Ttg 1 

or else the forecaster can again be outscored irrespective of the event that occurs. 

Let x, y, and z be the values taken by s under the occurrence of 9 X 9 2 , (1 — O\)0 2 , 
and (1 — 0 2 ), respectively: 

x = -fae^ ~ l) 2 - fae x e 2 - l) 2 - fae 2 ~ l) 2 

y = -<182 - <e 2 - fae 2 - !) 2 

Z ~ — ^8i 8 2 — - 7r 8 2 - 

To prove this, let fag^, TCg i g 2 , Jtg 2 ) be a point at which the gradients of x,y, and z 
are not in the same plane, meaning that the Jacobian 

d(x,y, z) 

d (jtg^ |8 2 , ZX/j | a 2 , Ttg^) 

is not zero. Then, it is possible to make x, y, and z smaller by moving 
(jv ei | 02 , 7Tf j] f) 2 ,jTf, 1 ). For a geometrical interpretation of this idea, look at de Finetti 
(1972). Therefore, the Jacobian, 8(jtg l \g 2 7rg 2 — n gi g 2 ), has to be equal to zero, which 
implies 7r 01% = □ 

Problem 2.3 Next week the Duke soccer team plays American University and 
Akron. Call 9 X the event Duke beats Akron and 0 2 the event Duke beats American. 
Suppose a bookmaker is posting the following betting odds and is willing to take 
stakes of either sign on any combination of the events: 


Event Odds 


0 X 4:1 
0 2 3:2 
O l +0 2 - 0 1 0 2 9:1 
O x 0 2 2:3 
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Find stakes that will enable you to win money from the bookmaker no matter what 
the results of the games are. 

Problem 2.4 (Bayes’ theorem) Let and 0 2 be two events. A corollary of Theo¬ 
rem 2.2 is the following. Provided that Jte 2 ^ 0, a necessary condition for coherence 
is that tt Si \o 2 = Tte 2 \e l Tte l /^e 2 - Prove this corollary without using Theorem 2.2. 

Problem 2.5 An insurance company, represented here by you, provides insurance 
policies against hurricane damage. One of your clients is especially concerned about 
tree damage (#i) and flood damage ( 0 2 ). In order to state the policies he is asked to 
elicit (personal) rates for the following events: Q u d 2 , 0i \ 0 2 , and 0 2 \ d { . Suppose his 
probabilities are as follows: 


Event 

Rates 

e l 

0.2 

92 

0.3 

01 1 02 

0.8 

02 1 0! 

0.9 


Show that your client can be made a sure loser. 
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Utility 


Broadly defined, the goal of decision theory is to help choose among actions whose 
consequences cannot be completely anticipated, typically because they depend on 
some future or unknown state of the world. Expected utility theory handles this 
choice by assigning a quantitative utility to each consequence, a probability to each 
state of the world, and then selecting an action that maximizes the expected value 
of the resulting utility. This simple and powerful idea has proven to be a widely 
applicable description of rational behavior. 

In this chapter, we begin our exploration of the relationship between acting ratio¬ 
nally and ranking actions based on their expected utility. For now, probabilities of 
unknown states of the world will be fixed. In the chapter about coherence we began to 
examine the relationship between acting rationally and reasoning about uncertainty 
using the laws of probability, but utilities were fixed and relegated to the background. 
So, in this chapter, we will take a complementary perspective. 

The pivotal contribution to this chapter is the work of von Neumann and 
Morgenstern. The earliest version of their theory (here NM theory) is included in 
the appendix of their book Theory of Games and Economic Behavior. The goal of 
the book was the construction of a theory of games that would serve as the founda¬ 
tion for studying the economic behavior of individuals. But the theory of utility that 
they constructed in the process is a formidable contribution in its own right. We will 
examine it here outside the context of game theory. 

Featured articles: 

Bernoulli, D. (1954). Exposition of a new theory on the measurement of risk, 
Econometrica 22: 23-36. 
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Jensen, N. E. (1967). An introduction to Bernoullian utility theory: I Utility 
functions, Swedish Journal of Economics 69: 163-183. 

The Bernoulli reference is a translation of his 1738 essay, discussing the cel¬ 
ebrated St. Petersburg’s paradox. Jensen (1967) presents a classic version of the 
axiomatic theory of utility originally developed by von Neumann and Morgenstern 
(1944). The original is a tersely described development, while Jensen’s account 
crystallized the axioms of the theory in the form that was adopted by the founda¬ 
tional debate that followed. Useful general readings are Kreps (1988) and Fishburn 
(1970). 

3.1 St. Petersburg paradox 

The notion that mathematical expectation should guide rational choice under uncer¬ 
tainty was formulated and discussed as early as the seventeenth century. An issue 
of debate was how to find what would now be called a certainty equivalent: that is, 
the fixed amount of money that one is willing to trade against an uncertain prospect, 
as when paying an insurance premium or buying a lottery ticket. Huygens (1657) 
is one of the early authors who used mathematical expectation to evaluate the fair 
price of a lottery. In his time, the prevailing thought was that, in modern termi¬ 
nology, a rational individual should value a game of chance based on the expected 
payoff. 

The ideas underlying modern utility theory arose during the Age of Enlighten¬ 
ment, as the evolution of the initial notion that the fair value is the expected value. 
The birthplace of utility theory is usually considered to be St. Petersburg. In 1738, 
Daniel Bernoulli, a Swiss mathematician who held a chair at the local Academy of 
Science, published a very influential paper on decision making (Bernoulli 1738). 
Bernoulli analyzed the behavior of rational individuals in the face of uncertainty 
from a Newtonian perspective, viewing science as an operational model of the human 
mind. His empirical observation, that thoughtful individuals do not necessarily take 
the financial actions that maximize their expected monetary return, led him to investi¬ 
gate a formal model of individual choices based on the direct quantification of value, 
and to develop a prototypical utility function for wealth. A major contribution has 
been to stress that: 

no valid measurement of the value of risk can be given without consid¬ 
eration of its utility, that is the utility of whatever gain accrues to the 
individual.... To make this clear it is perhaps advisable to consider the 
following example. Somehow a very poor fellow obtains a lottery ticket 
that will yield with equal probability either nothing or twenty thousand 
ducats. Would he not be ill advised to sell this lottery ticket for nine thou¬ 
sand ducats? To me it seems that the answer is in the negative. On the 
other hand I am inclined to believe that a rich man would be ill advised 
to refuse to buy the lottery ticket for nine thousand ducats. (Bernoulli 
1954, p. 24, an English translation of Bernoulli 1738). 
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Bernoulli’s development is best known in connection with the notorious St. 
Petersburg game, a gambling game that works, in Bernoulli’s own words, as 
follows: 

Perhaps I am mistaken, but I believe that I have solved the extraordi¬ 
nary problem which you submitted to M. de Montmort, in your letter of 
September 9, 1713 (problem 5, page 402). For the sake of simplicity I 
shall assume that A tosses a coin into the air and B commits himself to 
give A 1 ducat if, at the first throw, the coin falls with its cross upward; 2 
if it falls thus only at the second throw, 4 if at the third throw, 8 if at the 
fourth throw, etc. (Bernoulli 1954, p. 33) 

What is the fair price of this game? The expected payoff is 


1 

2 


1 

4 


1 

8 


lx-+2x-+4x-+. 


that is infinitely many ducats. So a ticket that costs 20 ducats up front, and yields 
the outcome of this game, has a positive expected payoff. But so does a ticket that 
costs a thousand ducats, or any finite amount. If expectation determines the fair 
price, no price is too large. Is it rational to be willing to spend an arbitrarily large 
sum of money to participate in this game? Bernoulli’s view was that it is not. He 
continued: 

The paradox consists in the infinite sum which calculation yields as the 
equivalent which A must pay to B. This seems absurd since no rea¬ 
sonable man would be willing to pay 20 ducats as equivalent. You ask 
for an explanation of the discrepancy between mathematical calculation 
and the vulgar evaluation. I believe that it results from the fact that, in 
their theory , mathematicians evaluate money in proportion to its quan¬ 
tity while, in practice , people with common sense evaluate money in 
proportion to the utility they can obtain from it. (Bernoulli 1954, p. 33) 

Using this notion of utility, Bernoulli offered an alternative approach: he sug¬ 
gested that one should not act based on the expected reward, but on a different kind 
of expectation, which he called moral expectation. His starting point was that any 
gain brings a utility inversely proportional to the whole wealth of the individual. 
Analytically, Bernoulli’s utility u of a gain A in wealth, relative to the current wealth 
z, can be represented as 

A 

u(z + A) — u(z) = c— 
z 

where c is some positive constant. For A sufficiently small we obtain 
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Integrating this differential equation gives 


u(z) = c log(z) - log(zo), 


(3.1) 


where zo is the constant of integration and can be interpreted here as the wealth nec¬ 
essary to get a utility of zero. The worth of the game, said Bernoulli, is the expected 
value of the gain in wealth based on (3.1). For example, if the initial wealth is 10 
ducats, the worth is 



Bernoulli computed the value of the game for various values of the initial wealth. If 
you own 10 ducats, the game is worth approximately 3 ducats, 6 ducats if you own 


1000 . 


Bernoulli moved rational behavior away from linear payoffs in wealth, but still 
needed to face the obstacle that the logarithmic utility is unbounded. Savage (1954) 
comments on an exchange between Gabriel Cramer and Daniel Bernoulli’s uncle 
Nicholas Bernoulli on this issue: 

Daniel Bernoulli’s paper reproduces portions of a letter from Gabriel 
Cramer to Nicholas Bernoulli, which establishes Cramer’s chronolog¬ 
ical priority to the idea of utility and most of the other main ideas of 
Bernoulli’s paper.... Cramer pointed out in his aforementioned letter, 
the logarithm has a serious disadvantage; for, if the logarithm were the 
utility of wealth, the St. Petersburg paradox could be amended to produce 
a random act with an infinite expected utility (i.e., an infinite expected 
logarithm of income) that, again, no one would really prefer to the status 
quo.... Cramer therefore concluded, and I think rightly, that the utility 
of cash must be bounded, at least from above. (Savage 1954, pp. 91-95) 

Bernoulli acknowledged this restriction and commented: 

The mathematical expectation is rendered infinite by the enormous 
amount which I can win if the coin does not fall with its cross upward 
until rather late, perhaps at the hundredth or thousandth throw. Now, as 
a matter of fact, if I reason as a sensible man, this sum is worth no more 
to me, causes me no more pleasure and influences me no more to accept 
the game than does a sum amounting only ten or twenty million ducats. 

Let us suppose, therefore, that any amount above 10 millions, or (for the 
sake of simplicity) above 2 24 = 166,777,216 ducats be deemed by me 
equal in value to 2 24 ducats or, better yet, that I can never win more than 
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that amount, no matter how long it takes before the coin falls with its 
cross upward. In this case, my expectation is 


1 1 1 

-1 + -2+-4- 
2 4 8 


— 2 24 + — 2 24 + — 2 24 - 
2 25 ^ 2 25 2 26 


■ (24 times) • ■ 


1 

4 


12+ 1 = 13. 


Thus, my moral expectation is reduced in value to 13 ducats and the 
equivalent to be paid for it is similarly reduced - a result which seems 
much more reasonable than does rendering it infinite. (Bernoulli 1954, 
p. 34) 

We will return to the issue of bounded utility when discussing Savage’s theory. 
In the rest of this chapter we will focus on problems with a finite number of out¬ 
comes, and we will not worry about unbounded utilities. For the purposes of our 
discussion the most important points in Bernoulli’s contribution are: (a) the distinc¬ 
tion between the outcome ensuing from an action (in this case the number of ducats) 
and its value; and (b) the notion that rational behavior may be explained and possi¬ 
bly better guided by quantifying this value. A nice historical account of Bernoulli’s 
work, including related work from Laplace to Allais, is given by Jorland (1987). For 
additional comments see also Savage (1954, Sec. 5.6), Berger (1985) and French 
(1988). 

3.2 Expected utility theory and the theory of means 

3.2.1 Utility and means 

The expected utility score attached to an action can be considered as a real-valued 
summary of the worthiness of the outcomes that may result from it. In this sense it 
is a type of mean. From this point of view, expected utility theory is close to theo¬ 
ries concerned with the most appropriate way of computing a mean. This similarity 
suggests that, before delving into the details of utility theory, it may be interesting 
to explore some of the historical contributions to the theory of means. The discus¬ 
sion in this section in based on Muliere and Parmigiani (1993), who expand on this 
theme. 

As we have seen while discussing Bernoulli, the notion that mathematical 
expectation should guide rational choice under uncertainty was formulated and dis¬ 
cussed as early as the seventeenth century. After Bernoulli, the debate on moral 
expectation was important throughout the eighteenth century. Laplace dedicated 
to it an entire chapter of his celebrated treatise on probability (Laplace 1812). 
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Interestingly, Laplace emphasized that the appropriateness of moral expectation 
relies on individual preferences for relative rather than absolute gains: 

[With D. Bernoulli], the concept of moral expectation was no longer a 
substitute but a complement to the concept of mathematical expectation, 
their difference stemming from the distinction between the absolute and 
the relative values of goods. The former being independent of, the latter 
increasing with the needs and desires for these goods. (Laplace 1812, 
pp. 189-190, translation by Jorland 1987) 

It seems fair to say that choosing the appropriate type of expectation of an uncer¬ 
tain monetary payoff was seen by Laplace as the core of what is today identified as 
rational decision making. 

3.2.2 Associative means 

After a long hiatus, this trend reemerged in the late 1920s and the 1930s in sev¬ 
eral parts of the scientific community, including mathematics (functional equations, 
inequalities), actuarial sciences, statistics, economics, philosophy, and probability. 
Not coincidentally, this is also the period when both subjective and axiomatic 
probability theories were born. Bonferroni proposed a unifying formula for the 
calculation of a variety of different means from diverse fields of application. He 
wrote: 

The most important means used in mathematical and statistical applica¬ 
tions consist of determining the number z that relative to the quantities 
with weights Wi,...,vv„, is in the following relation with 
respect to a function xfr: 


.... WiiAfeH-h w„ir(z„) 

x[r(z) = -;-;- (3.2) 

W i -I-h w„ 

I will take Zi,...,z„ to be distinct and the weights to be positive. 

(Bonferroni 1924, p. 103; our translation and modified notation for 
consistency with the remainder of the chapter) 

Here i/r is a continuous and strictly monotone function. Various choices of i/r 
yield commonly used means: i/r(z) = z gives the arithmetic mean, i //(z) = 1/z, 
Z > 0, the harmonic mean, i jr{z) = Z k the power mean (for I/O and z in some real 
interval I where i fr is strictly monotone), i//(z) = logz, z > 0, the geometric mean, 
i j/{z) = e : the exponential mean, and so forth. 

To illustrate the type of problem behind the development of this formalism, let us 
consider one of the standard motivating examples from actuarial sciences, one of the 
earliest fields to be concerned with optimal decision making under uncertainty. Con¬ 
sider the example (Bonferroni 1924) of an insurance company offering life insurance 
to a group of N individuals of which w i are of age Zi, w 2 are of age z 2 ,..., and w„ are 
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of age z„- The company may be interested in determining the mean age z of the group 
so that the probability of complete survival of the group after a number y of years is 
the same as that of a group of N individuals of equal age z . If these individuals share 
the same survival function Q , and deaths are independent, the mean z satisfies the 
relationship 

( Q(l + y) V = ( Q(z l +y) Y 1 ( Q(z„ + y) Y" 

V QCz) ) V QY) ) X ''' X V QY) ) 

which is of the form (3.2) with i/r(z) — log Q(z + y) — log Q(z)- 

Bonferroni’s work stimulated activity on characterizations of (3.2); that is, 
on finding a set of desirable properties of means that would be satisfied if and 
only if a mean of the form (3.2) is used. Nagumo (1930) and Kolmogorov 
(1930), independently, characterized (3.2) (for w, = 1) in terms of these four 
requirements: 

1. continuity and strict monotonicity of the mean in the zy, 

2. reflexivity (when all the z, are equal to the same value, that value is the mean); 

3. symmetry (that is, invariance to labeling of the z,); and 

4. associativity (invariance of the overall mean to the replacement of a subset of 
the values with their partial mean). 


3.2.3 Functional means 

A complementary approach stemmed from the problem-driven nature of Bon¬ 
ferroni’s solution. This approach is usually associated with Chisini (1929), who 
suggests that one may want to identify a critical aspect of a set of data, and compute 
the mean so that, while variability is lost, the critical aspect of interest is main¬ 
tained. For example, when computing the mean of two speeds, he would argue, 
one can be interested in doing so while keeping fixed the total traveling time, lead¬ 
ing to the harmonic mean, or the total fuel consumption, leading to an expression 
depending, for example, on a deterministic relationship between speed and fuel 
consumption. 

Chisini’s proposal was formally developed and generalized by de Finetti 
(1931a), who later termed it functional. De Finetti regarded this approach as the 
appropriate way for a subject to determine the certainty equivalent of a distribution 
function. In this framework, he reinterpreted the axioms of Nagumo and Kolmogorov 
as natural requirements for such choice. He also extended the characterization 
theorem to more general spaces of distribution functions. 

In current decision-theoretic terms, determining the certainty equivalent of a dis¬ 
tribution of uncertain gains according to (3.2) is formally equivalent to computing 
an expected utility score where i/r plays the role of the utility function. De Finetti 
commented on this after the early developments of utility theory (de Finetti 1952). 
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In current terms, his point of view is that the Nagumo-Kolmogorov characterization 
of means of the form (3.2) amounts to translating the expected utility principle into 
more basic axioms about the comparison of probability distributions. 

It is interesting to compare this later line of thinking to the treatment of sub¬ 
jective probability that de Finetti was developing in the early 1930s (de Finetti 
1931b, de Finetti 1937), and that we discussed in Chapter 2. There, subjective 
probability is derived based on an agent’s fair betting odds for events. Deter¬ 
mining the fair betting odds for an event amounts to declaring a fixed price at 
which the agent is willing to buy or sell a ticket giving a gain of S if the event 
occurs and a gain of 0 otherwise. Again, the fundamental notion is that of cer¬ 
tainty equivalent. However, in the problem of means, the probability distribution 
is fixed, and the existence of a well-behaved \jr (that can be thought of as a util¬ 
ity) is derived from the axioms. In the foundation of subjective probability, the 
utility function for money is fixed at the outset (it is actually linear), and the 
fact that fair betting odds behave like probabilities is derived from the coherence 
requirement. 


3.3 The expected utility principle 

We are now ready to formalize the problem of choosing among actions whose 
consequences are not completely known. We start by defining a set of outcomes 
(also called consequences, and sometimes rewards), which we call Z, with generic 
outcome z. Outcomes are potentially complex and detailed descriptions of all the 
circumstances that may be relevant in the decision problem at hand. Examples of 
outcomes include schedules of revenues over time, consequences of correctly or 
incorrectly rejecting a scientific hypothesis, health states following a treatment or 
an intervention, consequences of marketing a drug, change in the exposure to a toxic 
agent that may result from a regulatory change, knowledge gained from a study, and 
so forth. Throughout the chapter we will assume that only finitely many different 
outcomes need to be considered. 

The consequences of an action depend on the unknown state of the world: an 
action yields a given outcome in Z corresponding to each state of the world 6 in 
some set 0. We will use this correspondence as the defining feature of an action; that 
is, an action will be defined as a function a from 0 to Z. The set of all actions will 
be denoted here by A. A simple action is illustrated in Table 3.1. 

In expected utility theory, the basis for choosing among actions is a quantitative 
evaluation of both the utility of the outcomes and the probability that each outcome 
will occur. Utilities of consequences are measured by a real-valued function u on Z, 
while probabilities of states of the world are represented by a probability distribution 
7t on ©. The weighted average of the utility with respect to the probability is then 
used as the choice criterion. 

Throughout this chapter, the focus is on the utility aspect of the expected util¬ 
ity principle. Probabilities are fixed; they are regarded as the description of a 
well-understood chance mechanism. 
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Table 3.1 A simple decision problem with three possible 
actions, four possible outcomes, expressed as cash flows in 
pounds, and three possible states of the world. You can revisit 
it in Problem 3.3. 




States of nature 


9 = 1 

9 = 2 

II 

U> 

a,\ 

£100 

£110 

£120 

Actions a 2 

£90 

£100 

£120 

a 3 

£100 

£110 

£100 


An alternative way of describing an action in the NM theory is to think of it as a 
probability distribution over the set Z of possible outcomes. For example, if action 
a is taken, the probability of outcome z is 

P(z) = [ TT{0)dO, (3.3) 

Je-.a(6)=z 

that is the probability of the set of states of the world for which the outcome is z if a is 
chosen. If 0 is a subset of the real line, say (0,1), and n a continuous distribution, by 
varying a{9) we can generate any distribution over Z, as there are only finitely many 
elements in it. We could, for example, assume without losing generality that 7t is 
uniform, and that the “well-understood chance mechanism” is a random draw from 
(0,1). We will, however, maintain the notation based on a generic i r, to facilitate 
comparison with Savage’s theory, discussed in Chapter 5. 

In this setting the expected utility principle consists of choosing the action a that 
maximizes the expected value 


Uj,(a) = f u(a(9))n (9)d9 

Je 

of the utility. Equivalently, in terms of the outcome probabilities, 


U n {a) = ^pizMz). 

zeZ 


(3.4) 


(3.5) 


An optimal action a*, sometimes called a Bayes action, is one that maximizes the 
expected utility, that is 


a* = argmax U n (a). (3.6) 

Utility theory deals with the foundations of this quantitative representation of 
individual choices. 
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3.4 The von Neumann-Morgenstern representation 
theorem 

3.4.1 Axioms 

Von Neumann & Morgenstern (1944) showed how the expected utility principle cap¬ 
tured by (3.4) can be derived from conditions on ordinal relationships among all 
actions. In particular, they provided necessary and sufficient conditions for prefer¬ 
ences over a set of actions to be representable by a utility function of the form (3.4). 
These conditions are often thought of as basic rationality requirements, an interpre¬ 
tation which would equate acting rationally to making choices based on expected 
utility. We examine them in detail here. 

The set of possible outcomes Z has n elements {zi,Z 2 , ..., z„). As we have seen in 
Section 3.3, an action a implies a probability distribution over Z, which is also called 
lottery, or gamble. We will also use the notation p, for p{z,), and p — (p 1( ... ,/;„). 
For example, if Z = {zi, Zi, Z3}, the lotteries corresponding to actions a and a', 
depicted in Figure 3.1, can be equivalently denoted as p = (1/2, 1/4, 1/4) and 

p’ = (0, 0, 1). 

The space of possible actions A is the set of all functions from © to Z. A lottery 
does not identify uniquely an element of A , but all functions that lead to the same 
probability distribution can be considered equivalent for the purpose of the NM the¬ 
ory (we will meet an exception in Chapter 6 where we consider state-dependent 
utilities). 

As in Chapter 2, the notation -< is used to indicate a binary preference relation 
on A. The notation a -< a! indicates that action a’ is strictly preferred to action a. 





Figure 3.1 Two lotteries associated with a and a', and the lottery associated with 
compound action a" with a = 1/3. 
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Indifference between two outcomes (neither a < a' nor a' < a) is indicated by 
a ~ a'. The notation a f d indicates that d is preferred or indifferent to a. The 
relation -< over the action space A is called a preference relation if it satisfies these 
two properties: 

Completeness: for any two a and a' in A, one and only one of these three 
relationships must be satisfied: 

a -< a', or 
a >- a', or 

neither of the above. 

Transitivity: for any a, a", and a" in A, the two relationships a f a' and a' f a!' 
together imply that a ^ a". 

Completeness says that, in comparing two actions, “I don’t know” is not allowed 
as an answer. Considering how big A is, this may seem like a completely unreason¬ 
able requirement, but the NM theory will not really require the decision maker to 
actually go through and make all possible pairwise comparisons: it just requires that 
the decision maker will find the axioms plausible no matter which pairwise compari¬ 
son is being considered. The “and only one” part of the definition builds in a property 
sometimes called asymmetry. 

Transitivity is a very important assumption, and is likely to break down in prob¬ 
lems where outcomes have multiple dimensions. Think of a sport analogue. “Team 
A is likely to beat team B” is a binary relation between teams. It could be that team 
A beats team B most of the time, that team B beats team C most of the time, and yet 
that team C beats team A more often than not. Maybe team C has offensive strategies 
that are more effective against A than B, or has a player that can erase A’s star player 
from the game. The point is that teams are complex and multidimensional and this 
can generate cyclical, or non transitive, binary relations. Transitivity will go a long 
way towards turning our problem into a one-dimensional one and paving the way for 
a real-valued utility function. 

A critical notion in what follows is that of a compound action. For any two actions 
a and a', and for a e [0,1], a compound action a" = aa + ( 1 — a)a r denotes the 
action that assigns probability ap(z ) + (1 — u)p'(z) to outcome z. For example, in 
Figure 3.1, a" = (1/3 )a + (2/3 )a' which implies that p" = (1/6, 1/12, 9/12). The 
notation aa + (1 — a)a' is a shorthand for pointwise averaging of the probability 
distributions implied by actions a and a', and does not indicate that the outcomes 
themselves are averaged. In fact, in most cases the summation operator will not be 
defined on Z. 

The axioms of the NM utility theory, in the format given by Jensen (1967) (see 
also Fishburn 1981, Fishburn 1982, Kreps 1988), are 

NM1 -< is complete and transitive. 

NM2 Independence: for every a, a 1 , and a" in A and a e (0,1], we have 
a > a' implies (1 — a)a" + aa >- (1 — a)a" + aa'. 
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NM3 Archimedean: for every a, a', and a" in A such that a > a' > a" we can find 
a, f e (0,1) such that 


aa + (1 — a)a" >- a >- fa + (1 — f)a". 


The independence axiom requires that two composite actions should be com¬ 
pared solely based on components that may be different. In economics this is a 
controversial axiom from both the normative and the descriptive viewpoints. We will 
return later on to some of the implications of this axiom and the criticisms it has 
drawn. An important argument in favor of this axiom as a normative axiom in statis¬ 
tical decision theory is put forth by Seidenfeld (1988), who argues that violating this 
axiom amounts to a sequential form of incoherence. 

The Archimedean axiom requires that, when a is preferred to a', it is not pre¬ 
ferred so strongly that mixing a with a" cannot lead to a reversal of preference. So a 
cannot be incommensurably better than a'. Likewise, a" cannot be incommensurably 
worse than a'. A simple example of a violation of this axiom is lexicographic pref¬ 
erences: that is, preferences that use some dimensions of the outcomes as a primary 
way of establishing preference, and use other dimensions only as tie breakers if the 
previous dimensions are not sufficient to establish a preference. Look at the worked 
Problem 3.1 for more details. 

In spite of the different context and structural assumptions, the axioms of the 
Nagumo-Kolmogorov characterization and the axioms of von Neumann and Mor- 
genstern offer striking similarities. For example, there is a parallel between the 
associativity condition (substituting observations with their partial mean has no effect 
on the global mean) and the independence condition (mixing with the same weight 
an option to two other options will not change the preferences). 

3.4.2 Representation of preferences via expected utility 

Axioms NM1, NM2 and NM3 hold if and only if there is a real valued utility function 
u such that the preferences for the options in A can be represented as in expression 
(3.4). A given set of preferences identifies a utility function u only up to a linear 
transformation with positive slope. This result is formalized in the von Neumann- 
Morgenstern representation theorem below. 

Theorem 3.1 Axioms NM1, NM2, and NM3 are true if and only if there exists a 
function u such that for every pair of actions a and a' 



(3.7) 


and u is unique up to a positive linear transformation. 


An equivalent representation of expression (3.7) is 



(3.8) 
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Understanding the proof of the von Neumann-Morgenstern representation the¬ 
orem is a bit of a time investment but it is generally worthwhile, as it will help in 
understanding the implications and the role of the axioms as well as the meaning 
of utility and its relation to probability. It will also lead us to an intuitive elicitation 
approach for individual utilities. 

To prove the theorem we will start out with a lemma, the proof of which is left 
as an exercise. 


Lemma 1 If the binary relation > satisfies Axioms NM1, NM2, and NM3, then: 

(a) If a >- a' then 

fa + (1 — f)a' >- aa + (1 — a)a' if and only if 0 < a < f < 1. 

(b) a > a' > a",a >- a" imply that there is a unique a* e [0,1] such that 
a' ~ a*a + (1 — a*)a". 

(c) a ~ a ', a e [0, 1] imply that 

aa + (1 — a)a" ~ aa' + (1 — a)a", V a" e A. 

Taking this lemma for granted we will proceed to prove the von Neumann- 
Morgenstern theorem. Parts (a) and (c) of the lemma are not too surprising. The 
heavyweight is (b). Most of the weight in proving (b) is carried by the Archimedean 
axiom. If you want to understand more look at Problem 3.6. 

We will prove the theorem in the =>■ direction. The reverse is easier and is left 
as an exercise. There are two important steps in the proof of the main theorem. First 
we will show that the preferences can be represented by some real-valued function 
(p: that is, that there is a tp such 

a >- a! 4^ tp(a) > (p{a'). 

This means that the problem can be captured by a one-dimensional “score” repre¬ 
senting the worthiness of an action. Considering how complicated an action can be, 
this is no small feat. The second step is to prove that this <p must be of the form of an 
expected utility. 

Let us start by defining Xz as the action that gives outcome z for every 9. The 
implied p assigns probability I to z (see Figure 3.2). 

A fact that we will state without detailed proof is that, if preferences satisfy 
Axioms NM1, NM2, and NM3, then there exists z. a (worst outcome) and z° (best 
outcome) in Z such that 


X z o>a> Xzo 

for every a e A. One way to prove this is by induction on the number of outcomes. 
To get a flavor of the argument, try the case n = 2. 
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Figure 3.2 The lottery implied by x Z2 , which gives outcome z 2 with certainty. 

We are going to rule out the trivial case where the decision maker is indifferent 
to everything, so that we can assume that z° and zo are also such that /,o >- x-„ ■ 
Lemma 1, part (b) guarantees that there exists a unique a e [0,1] such that 

a ~ ax z o + (1 - a)Xz 0 - 

We define ip(a ) as the value of a that satisfies the condition above. Consider now 
another action a' e A. We have 

a ~ cp(a)Xz° + (1 - (P(a))Xz 0 (3-9) 

a' ~ (p(a')x z o + (1 — ¥ , (a , ))X: 0 i (3-10) 

from Lemma 1, part (a) cp(a) > (p(a r ) if and only if 

(P(a)x z o + (1 - 9(a))Xzo <P(d)Xfi + (1 - <p(a’))x zo , 

which in turn holds if and only if 

a >- a'. 

So we have proved that ip(.), the weight at which we are indifferent between the 
given action and a combination of the best and worst with that weight, provides a 
numerical representation of the preferences. 

We now move to proving that (p(.) takes the form of an expected utility. Now, 
from applying Lemma 1, part (c) to the indifference relations (3.9) and (3.10) 
we get 

ota + (1 - a)a’ ~ a[<p(a)x z o + (1 - <p(«))Xz 0 ] + (1 - ot)[tp(a’)x z o + (1 - <p(a’))x z0 ]- 
We can reorganize this expression and obtain 

aa + (1 — a)a ~ 

[otq>(a) + (1 - a)v?(a')]Xz» + [«(1 - <P(a)) + (1 - a)(l - <p(a’))]x z 0 - (3.11) 
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By definition (p(a) is such that 

a ~ <p(a)x -o + (1 - </>(a))Xzo- 
Therefore <p(ota + (1 — ot)a') is such that 

eta + (1 — a)a' ~ </>(aa + (1 — a)a')Xz 0 + (1 — (p(cta + (1 — a)a'))x z 0 - 
Combining the expression above with (3.11) we can conclude that 
<p(aa + (1 — a)a') — onp(a) + (1 — a)cp(a'), 


proving that <p(.) is an affine function. 

We now define u(z) — <p(x z )- Again, by definition of tp. 


Xz ~ u(z)Xz o + (1 - u(z))Xzq- 

Eventually we would like to show that <p(a) = u(z)p(z), where the p are the 

probabilities implied by a. We are not far from it. Consider Z = {zi ,... ,z n }- The 
proof is by induction on n. 

(i) If Z — {",} take a = Xz, - Then, x z ° an d X- n are equal to X-, and therefore, 
<p(a) = u(zi)p(zi). 


(ii) Now assume that the representation holds for actions giving positive mass 
to Z = {zi,..., z m -i] only and consider the larger outcome space Z = 
{zi, ■ ■ ■, Z m -i,Zm}- Say z m is such that p(z m ) > 0; that is, z m is in the support of 
p. (If p(z,„) = 0 we are covered by the previous case.) We define 


p'(z) = 



1 - P(Zm) ’ 


if z = z m 
if z^z m 


Here,// has support of size (m — 1). We can reexpress the definition above as 


p(z) = p(z m )I Zm + (1 - p{z m ))p'(z), 


where I is the indicator function, so that 


a ~ p(z m )Xz,„ + (1 - p(z,„))a'. 

Applying on both sides we obtain 

<P(a) = p(Zm)<p(XzJ + (1 - p(z m ))<p(a) = p(z m )u(z m ) + (1 - p{z m ))<p{a'). 

We can now apply the induction hypothesis on a', since the support of p' is of 
size (in — 1). Therefore, 

ep(a) = p(z m )u(z m ) + (1 - p(z, n )) u(z)p'(z). 

Z^Zm 
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Using the definition of p' it follows that 

m 

<p(a) = p(z,„)u(z m ) + ^2p(z)u(z ) = ^ pfcMz,), 

Z^Zm *=1 

which completes the proof. □ 

3.5 Allais’ criticism 

Because of the centrality of the expected utility paradigm, the von Neumann- 
Morgenstern axiomatization and its derivatives have been deeply scrutinized and 
criticized from both descriptive and normative perspectives. Empirically, it is well 
documented that individuals may willingly violate the independence axiom (Allais 
1953, Kreps 1988, Kahneman et al. 1982, Shoemaker 1982). Normative ques¬ 
tions have also been raised about the weak ordering assumption (Seidenfeld 1988, 
Seidenfeld et al. 1995). 

In a landmark paper, published in 1953, Maurice Allais proposed an example that 
challenged both the normative and descriptive validities of expected utility theory. 
The example is based on comparing the hypothetical decision situations depicted in 
Figure 3.3. In one situation, the decision maker is asked to choose between lotteries 
a and a'. In the next, the decision maker is asked to choose between lotteries b and 
b'. Consider these two choices before you continuing reading. What would you do? 

In Savage’s words: 

Many people prefer gamble a to gamble a', because, speaking qualita¬ 
tively, they do not find the chance of winning a very large fortune in 
place of receiving a large fortune outright adequate compensation for 
even a small risk of being left in the status quo. Many of the same people 
prefer gamble b' to gamble b; because, speaking qualitatively, the chance 
of winning is nearly the same in both gambles, so the one with the much 
larger prize seems preferable. But the intuitively acceptable pair of pref¬ 
erences, gamble a preferred to gamble a’ and gamble b' to gamble /;, 
is not compatible with the utility concept. (Savage 1954, p. 102, with a 
change of notation) 

Why is it that these preferences are not compatible with the utility concept? The 
pair of preferences implies that any utility function must satisfy 

m( 5) > 0.1w(25) + 0.89w(5) + 0.01n(0) 

0.1k(25) + 0.9m(0) > 0.1 1m( 5) + 0.89w(0). 

Here arguments are in units of $100000. You can easily check that one cannot have 
both inequalities at once. So there is no utility function that is consistent with gamble 
a being preferred to gamble a' and gamble b’ to gamble b. 
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0.01 


$0 


a 


1.00 


$500,000 



-<=- $500,000 


$2,500,000 



$0 



0 . 90 ^ $0 


b 


b' 


$500,000 


$2,500,000 


Figure 3.3 The lotteries involved in the so-called Allais paradox. Decision makers 
often prefer a to a' and b' to b—a violation of the independence axiom in the NM 
theory. 

Specifically, which elements of the expected utility theory are being criticized 
here? Lotteries a and a' can be viewed as compound lotteries involving another 
lottery a*, which is depicted in Figure 3.4. In particular, 


a = 0.1 la + 0.89a 
a' — 0.1 la* + 0.89a. 


By the independence axiom (Axiom NM2 in the NM theory) we should have 

a > a' if and only if a >- a*. 

Observe, on the other hand, that 


b = 0.89xo + 0.11x5 = 0.89xo + 0.11a 
>- 0.89xo + 0.1 la* 



= 0.90xo + 0.10x25 = b'. 



$0 


a 


$2,500,000 


Figure 3.4 Allais Paradox: Lottery a*. 
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If we obey the independence axiom, we should also have b >- b'. Therefore, 
preferences for a over a', and b' over b, violate the independence axiom. 


3.6 Extensions 

Expected utility maximization has been a fundamental tool in guiding practi¬ 
cal decision making under uncertainty. The excellent publications by Fishburn 
(1970), Fishburn (1989) and Kreps (1988) provide insight, details, and refer¬ 
ences. The literature on the extensions of the NM theory is also extensive. Good 
entry points are Fishburn (1982) and Gardenfors and Sahlin (1988). A general 
discussion of problems involved in various extensions can be found in Kreps 
(1988). 

Although in our discussion of the NM theory we used a finite set of outcomes Z, 
similar results can be obtained when Z is infinite. For example, Fishburn (1970, 
chapter 10) provides generalization to continuous spaces by requiring additional 
technical conditions, which effectively allow one to define a utility function that 
is measurable, along with a new dominance axiom, imposing that if a is preferred 
to every outcome in a set to which p' assigns probability 1, then a' should not be 
preferred to a, and vice versa. A remaining limitation of the results discussed by 
Fishburn (1970) is the fact that the utility function is assumed to be bounded. Kreps 
(1988) observes that the utility function does not need to be bounded, but only such 
that plus and minus infinity are not possible expected utilities under the consid¬ 
ered class of probability distributions. He also suggests an alternative version for 
the Archimedean Axiom NM3 that, along with the other axioms, provides a general¬ 
ization of the von Neumann-Morgenstern representation theorem in which the utility 
is continuous over a real-valued outcome space. 


3.7 Exercises 

Problem 3.1 (Kreps 1988) Planners in the war room of the state of Freedonia can 
express the quality of any war strategy against arch-rival Sylvania by a probability 
distribution on the three outcomes: Freedonia wins; draws; loses (zi, z. 3 , and z 3 ). 
Rufus T. Firefly, Prime Minister of Freedonia, expresses his preferences over such 
probability distributions by the lexicographic preferences 

a >- a' 


whenever 

Pi < p\ or [p 3 = p’ 3 and p 2 < p 2 \ 

where p = ( Pi,Pi,p 3 ) is the lottery associated with action a and p = (p\,p' 2 ,p 3 ) the 
lottery associated with action a'. 

Which of the three axioms does this binary relation satisfy (if any) and which 
does it violate? 
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Solution 

NM1 First, let us check whether the preference relation is complete. 

(i) Ifp 3 < p\ then a > a (by the definition of the binary relation). 

(ii) If fb > p', then a > a. 

(Hi) If Pi — p' 3 then 

(111.1) if p 2 < p' 2 then a > a', 

(111.2) if p 2 > p 2 then a! > a , 

(111.3) if p 2 = p' 2 then it must also be that p, = p\ and the two actions are 
the same. 

Therefore, the binary relation is complete. Let us now check whether it is 
transitive, by considering actions such that a' > a and a" > a'. 

(i) Suppose p\ < p 3 and < p' v Then, clearly, p" 3 < p 3 and therefore a" > a. 

(ii) Suppose p' 3 < p 3 , p 3 — /?", and /;" < p' 2 . We have p 3 < p 3 and therefore 
a" >■ a. 

(iii) Suppose p' 3 = p 3 ,p' 2 < p 2 , and p" < p' 3 . It follows that p" 3 < p 3 . Thus, 
a" > a. 

(iv) Suppose p' 3 = p 3 , p' 2 < p 2 , and p" — p' 3 and p 2 < p' 2 . We now have p 3 — p" 
and p 2 < p 2 . Therefore, a!' > a. 

From (i)-(iv) we can conclude that the binary relation is also transitive. So Axiom 
NM1 is not violated. 

NM2 Take a e (0,1]: 

(i) Suppose a > a’ due to p 3 < p' 3 . For any value of a e (0,1] we have 
ap 3 < ap' 3 ap 3 + (1 — a)p" < ap 3 + (1 — a)p 3 . Thus, aa + (1 — a)a" >- 
ota! + (1 — a)a". 

(ii) Suppose a > a' because p 3 = p\ and p 2 < /;). For any value of a e (0,1] 
we have: 

(11.1) ap 3 = ap' 3 =$ ap 3 + (1 — a)p 3 = ap' 3 + (1 — a)p". 

(11.2) ap 2 < ap 2 ap 2 + (1 — a)p 2 < ap' 2 + (1 — a)p 2 . 

From (ii.l) and (ii.2) we can conclude that ota + (1 — a)a" > ota' + 
(1 - ot)a". 

Therefore, from (i) and (ii) we can see that the binary relation satisfies Axiom NM2. 
NM3 Consider a, a 1 , and a" in A and such that a > o' > a". 
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(i) Suppose that a >- a' >- a" because p 3 < p 3 < p" 3 . 

Since p 3 < p' 3 < p 3 , 3a, /I e (0,1) such that ap 3 + (1 — a)p 3 < p' 3 < 
fip 3 + (\ — P)p" since this is a property of real numbers. So, aa+(l — a)a" >- 
a! >- Pa + (1 — fi)a". 

(ii) Suppose that a > a' >- a" because p 2 < p' 2 < p 2 (and p 3 = p 3 = p 3 ). 

Since p 3 = p’ 3 =p",Va,p e (0,1): ap 3 +(l-a)p 3 = p\ = Pp 3 +(l~P)p 3 . 
Also, p 2 < p’ 2 < p 2 implies that 3 a, ft e (0,1) such that ap 2 + (1 — oi)p 2 < 
p' 0 < fip 2 + (1 — P)p 2 - Thus, aa + (1 — a)a" >- a' >- fia + (1 — P)a". 

(iii) Suppose that a > a' >- a" because of the following conditions: 

(a) p 3 < p 3 , 

(b) p' 3 — p 3 andp' 2 < p"_. 

We have ap 3 + (1 — a)p 3 = ap 3 + (1 — a)p' 3 < ap' 3 + (1 — a)p' 3 = p' 3 . 
This implies that ap 3 + (1 — a)p 3 < p\. Therefore, we cannot have a' > 
aa + (1 — a)a". 

By (iii) we can observe that Axiom NM3 is violated. □ 

Problem 3.2 (Berger 1985) An investor has 1000$ to invest in speculative stocks. 
The investor is considering investing a dollars in stock A and (1000 — a) dollars in 
stock B. An investment in stock A has a 0.6 chance of doubling in value, and a 0.4 
chance of being lost. An investment in stock B has a 0.7 chance of doubling in value, 
and a 0.3 chance of being lost. The investor’s utility function for a change in fortune, 
z, is u(z ) = log(0.0007z + 1) for —1000 < z < 1000. 

(a) What is Z (for a fixed a )? (It consists of four elements.) 

(b) What is the optimal value of a in terms of expected utility? (Note: This 
perhaps indicates why most investors opt for a diversified portfolio of 
stocks.) 

Solution 


(a) Z = { — 1000, 2a - 1000,1000 - 2a, 1000}. 

(b) Based on Table 3.2 we can compute the expected utility, that is 

U = 0.121og(0.3) + 0.18 log(0.0014a + 0.3) 

+ 0.28 log(1.7 - 0.0014a) + 0.421og(1.7). 


Now, let us look for the value of a which maximizes the expected utility (that 
is, the optimum value a*). We have 


dU 

da 


0.18 

0.0014a+ 0.3 


0.0014 


0.28 

1.7-0.0014a 


(-0.0014) 


0.000252 0.000 392 

0.0014a+ 0.3 ~ 1.7-0.0014a' 
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Table 3.2 Rewards, utilities, and probabilities for Problem 3.2, assuming 
that the investment in stock A is independent of the investment of stock B. 


z 

-1000 

2a- 1000 

1000-2a 

1000 

Utility 

log(0.3) 

log(0.0014a + 0.3) 

log( 1.7-0.0014a) 

log(1.7) 

Probability 

(0.4) (0.3) 

(0.6) (0.3) 

(0.4) (0.7) 

(0.6) (0.7) 


Setting the derivative equal to zero and evaluating it at a* leads to the 
following equation: 


0.000252 0.000 392 

0.0014a* + 0.3 “ 1.7 -0.0014a*' 

By solving this equation we obtain a* = 344.72. Also, since the second 
derivative 


d 2 U 0.000252(0.0014) 0.000 392(0.0014) 

la 2 ~ ~ (0.0014fl + 0.3) 2 (1.7-0.0014a) 2 < 

we conclude that a* = 344.72 maximizes the expected utility U. 


□ 

Problem 3.3 (French 1988) Consider the actions described in Table 3.1 in which 
the consequences are monetary payoffs. Convert this problem into one of choosing 
between lotteries, as defined in the von Neumann-Morgenstern theory. The decision 
maker holds the following indifferences with reference lotteries: 

£100 ~ £120 with probability 1/2; £90 with probability 1/2; 

£ 110 ~ £ 120 with probability 4/5; £90 with probability 1/5. 

Assume that n( 6 1) = n(0 3 ) =1/4 and tt( 0 2 ) = 1/2. Which is the optimal action 
according to the expected utility principle? 

Problem 3.4 Prove part (a) of Lemma 1. 

Problem 3.5 (Berger 1985) Assume that Mr. A and Mr. B have the same utility 
function for a change, z„ in their fortune, given by u(z) = z 1/3 . Suppose now that 
one of the two men receives, as a gift, a lottery ticket which yields either a reward 
of r dollars (r > 0) or a reward of 0 dollars, with probability 1/2 each. Show that 
there exists a number b > 0 having the following property: regardless of which man 
receives the lottery ticket, he can sell it to the other man for b dollars and the sale 
will be advantageous to both men. 

Problem 3.6 Prove part (b) of Lemma 1. The proof is by contradiction. Define 
a* = sup{a e [0, 1] : a' £3 aa + (1 — a)a"} 
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and consider separately the three cases: 

a* a + (1 — a*)a" > a! >- a" 
a > a' > a*a + (1 — a*)a" 
a' ~ a*a + (1 — a*)a". 

The proof is in Kreps (1988). Do not look it up, this was a big enough hint. 

Problem 3.7 Prove the von Neumann-Morgenstern theorem in the 4 = direction. 

Problem 3.8 (From Kreps 1988) Kahneman and Tversky (1979) give the following 
example of a violation of the von Neumann-Morgenstern expected utility model. 
Ninety-five subjects were asked: 

Suppose you consider the possibility of insuring some property against 
damage, e.g., fire or theft. After examining the risks and the premium, 
you find that you have no clear preference between the options of 
purchasing insurance or leaving the property uninsured. 

It is then called to your attention that the insurance company offers a 
new program called probabilistic insurance. In this program you pay half 
of the regular premium. In case of damage, there is a 50 percent chance 
that you pay the other half of the premium and the insurance company 
covers all the losses; and there is a 50 percent chance that you get back 
your insurance payment and suffer all the losses... 

Recall that the premium is such that you find this insurance is barely 
worth its cost. 

Under these circumstances, would you purchase probabilistic 
insurance?” 

And 80 percent of the subjects said that they would not. Ignore the time value of 
money. (Because the insurance company gets the premium now, or half now and half 
later, the interest that the premium might earn can be consequential. We want you 
to ignore such effects. To do this, you could assume that if the insurance company 
does insure you, the second half of the premium must be increased to account for the 
interest the company has forgone. While if it does not, when the company returns the 
first half premium, it must return it with the interest it has earned. But it is easiest 
simply to ignore these complications altogether.) The question is: does this provide a 
violation of the von Neumann-Morgenstern model, if we assume (as is typical) that 
all expected utility maximizers are risk neutral? 
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Utility in action 


The von Neumann and Morgenstern (NM) theory is an axiomatization of the 
expected utility principle, but its logic also provides the basis for measuring an indi¬ 
vidual decision maker’s utilities for outcomes. In this chapter, we discuss how NM 
utility theory is typically applied to utility elicitation in practical decision problems. 
In Section 4.1 we discuss a general utility elicitation approach, and then review basic 
applications in economics and medicine. 

In Section 4.2, we apply NM utility theory to situations where the set of rewards 
consists of alternative sums of money. An important question in this scenario is 
how attitudes towards future uncertainties change as current wealth changes. The 
key concept in this regard is risk aversion. We review some intuitive results which 
describe risk aversion mathematically and characterize properties of utility functions 
for money. Much of our discussion is based on Keeney et al. (1976) and Kreps 
(1988). 

In Section 4.3, we discuss how the lottery approach can be used to elicit patients’ 
preferences when the outcomes are related to future health. The main issue we will 
consider is the trade-off between length and quality of life. We also introduce the con¬ 
cept of a quality-adjusted life year (QALY), which is defined as the period of time 
in good health that a patient says is equivalent to a year in ill health, and commonly 
used in medical decision-making applications. We discuss its relationship with util¬ 
ity theory, and illustrate both with a simple example. A seminal application of this 
methodology in medicine is McNeil et al. (1981). 

Featured articles: 

Pratt, J. (1964). Risk aversion in the small and in the large, Econometrica 
32: 122-136. 

Torrance, G., Thomas, W. and Sackett, D. (1972). A utility maximization model for 
evaluation of health care programs, Health Services Research 7: 118-133. 
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Useful general readings are Kreps (1988) for the economics; Pliskin et al. (1980) 
and Chapman and Sonnenberg (2000) for the medicine. 


4.1 The “standard gamble” 

When we presented the proof of the NM representation theorem we encountered one 
step where the utility of each outcome was constructed by identifying the value u that 
would make receiving that outcome for sure indifferent to a lottery giving the best 
outcome with probability u and the worse with probability 1 — u. This step is also 
the essence of a widely used approach to utility elicitation called “standard gamble.” 

As a refresher, recall that in the NM theory Z is the set of rewards. The set 
of actions (or lotteries, or gambles) is the set probability distributions on Z. It is 
called A and its typical elements are things like a, a', a". In particular, x- denotes the 
degenerate lottery with mass 1 at reward z. We also define u : Z -> 9i as the decision 
maker’s utility function. Given a utility function u, the expected utility of lottery a is 

U{d) = T; u(z)p(z). (4.1) 

Z 

Preferences are described by the binary relation >-, satisfying the NM axioms, so that 
a >- a' if and only if 

Y u (z)p(z) > Y dAp(z). 

Z Z 

When using the standard gamble approach to utility elicitation, the decision maker 
lists all outcomes that can occur and ranks them in order of preference. If we avoid 
the boring case in which all outcomes are valued equally by the decision maker, the 
weak ordering assumption allows us to identify a worst outcome Zo and a best out¬ 
come z°. For example, in assessing the utility of health states, “death” is often chosen 
as the worse outcome and “full health” as the best, although in some problems there 
are health outcomes that could be ranked worse than death (Torrance 1986). Worst 
and best outcomes need not be unique. Because all utility functions that are posi¬ 
tive linear transformations of the same utility function lead to the same preferences, 
we can arbitrarily set u(zo ) = 0 and u(z°) = 1, which results in a convenient and 
interpretable utility scale. 

The decision maker’s utility for each intermediate outcome z can be inferred by 
eliciting the value of u such that he or she is indifferent between two actions: 

a : outcome z for certain; 
a 1 : outcome Zo with probability 1 — n, or 
outcome z° with probability u. 

Another way of writing action a' is ux z o + (1 — u )Xz n - The existence of a value of u 
reaching indifference is implied by the Archimedean and independence properties of 
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the decision maker’s preferences, as we have seen from Lemma 1. As you can check, 
the expected utility of both actions is u, and therefore u(z) = u. 

4.2 Utility of money 

4.2.1 Certainty equivalents 

After a long digression we are ready to go back to the notion, introduced by 
Bernoulli, that decisions about lotteries should be made by considering the moral 
expectation, that is the expectation of the value of money to the decision maker. We 
are now ready to explore in more detail how utility functions for money typically 
look. 

Say you are about to ship home a valuable rug you just bought in Samarkand for 
$9000. The probability that it will be lost during transport is 3% according to your 
intelligence in Uzbekistan. At which price would you be indifferent between buying 
the insurance or taking the risk? This price defines your certainty equivalent of the 
lottery defined by shipping without insurance. 

Formally, assume that the outcome space Z is an open interval, that is Z = 
( zo,z° ) C 91. Then, using the same notation as in Chapter 3, a certainty equivalent 
is any reward z e 91 that makes you indifferent between that reward for sure, or 
choosing action a. 

Definition 4.1 (Certainty equivalent) A certainty equivalent of lottery a is any 
amount z* such that 

Xz* ~ a, (4.2) 

or equivalently, 

u(z*) = ^2 u(z)p(z). (4.3) 

Z 

A certainty equivalent is also referred to as “cash equivalent” and “selling price” (or 
“asking price”) of a lottery. 

4.2.2 Risk aversion 

Meanwhile, in Samarkand, you calculate the expected monetary loss from the unin¬ 
sured shipment, that is $9000 times 0.03 or $270. Would you be willing to pay more 
to buy insurance? If you do, you qualify as a risk-averse individual. 

If you define 

z = Y, zp(z) (4.4) 

Z 

as the expected reward under lottery a , then: 

Definition 4.2 (Risk aversion) A decision maker is strictly risk averse if 


Xz >- a. 


( 4 . 5 ) 
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Someone holding the reverse preference is called strictly risk seeking while someone 
who is indifferent is called risk neutral. 

It turns out that the definition above is equivalent to saying that the decision 
maker is risk averse if xi > Xz* 1 that is, if he or she prefers the expected reward for 
sure to receiving for sure a certainty equivalent. 

Proposition 4.1 A decision maker is strictly risk averse if and only if his or her 
utility function is strictly concave, that is 

Xi >- a <t=4> u zp(z)j > ^2 (4-6) 

Proof: Suppose first that the decision maker is risk averse and consider a lottery that 
yields either z,i or Z 2 with probabilities p and 1 — p, respectively, where 0 < p < 1 . 
From Definition 4.2, Xi >- a - The NM theory implies that 

k(PZi + (1 - p)z 2 ) > pu(zi) + (1 - p)u{z 2 ), 0 < p < 1, (4.7) 


which is the definition of strict concavity. Conversely, consider a lottery a over Z. 
Since u is strictly concave, we know that 


u 


y" y zp(z) ] > ^2 


(4.8) 


and it follows once again from (4.5) that u is risk averse. 


□ 


Analogously, strict risk seeking and risk neutrality can be defined in terms of 
convexity and linearity of u, respectively, as illustrated in Figure 4.1. 

Certainty equivalents exist and are unique under relatively mild conditions: 



Figure 4.1 Utility functions corresponding to different risk behaviors. 
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Proposition 4.2 (a) If u is strictly increasing, then z* is unique. 

(b) If u is continuous, then there is at least one certainty equivalent. 

(c) If u is concave, then there is at least one certainty equivalent. 

For the remainder of this chapter, we will assume that u is a concave strictly 
increasing utility function, so that the certainty equivalent is 



(4.9) 


Back in Samarkand you have done some thinking and concluded that you would 
be willing to pay up to $350 to insure the rug. The extra $80 amount you are willing to 
pay, on top of the expected value of $270, is called the insurance premium. If insurers 
did not ask for this extra amount they would need a day job. The risk premium is the 
flip side of the insurance premium: 

Definition 4.3 (Risk premium) The risk premium associated with a lottery a is the 
difference between the lottery’s expected value and its certainty equivalent: 


RP(a) = z — z* (a). 


(4.10) 


Figure 4.2 shows this. The negative of the risk premium, or — RP(a ), is the insurance 
premium. This works out in our example too. You have bought the rug already, so we 
are talking about negative sums of money at this point. 


Utility 


u(z) 

u(z*) 



\ 


z 


z 


z 


Figure 4.2 Risk premium. 
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4.2.3 A measure of risk aversion 

We discuss next how the decision maker’s behavior changes when his or her initial 
wealth changes proportionally. Let co be the decision maker’s wealth prior to the 
gamble and let a + co be a shorthand notation to represent the final wealth position. 
Then the decision maker seeks to maximize 

U(a + co) = ^2 u (z)p(z + w). (4.11) 

Z 

For example, suppose that the choices are: (i) gamble represented by a, or (ii) 
sure payment in the amount z. Then, we can represent his or her final wealth as a + co 
or z + co. The choice will depend on whether U(a + a>) is greater, equal, or lower than 
u(z, + co). 

Definition 4.4 (Decreasing risk aversion) A utility function u is said to be 
decreasingly absolute risk averse if for all a e A,z G Hi, co , co' e Hi, such that 
a + co, a + co',a + z, and co' + z all lie in Z and co' > co, 

if U{a + co) > u(z + co), then U(a + co') > u(z + co 1 ). (4.12) 

The idea behind this definition is that, for instance, if you are willing to take the 
lottery over the sure amount when your wealth is 10 000 dollars, you would also take 
the lottery if you had a larger wealth. The richer you are, the less risk averse you 
become. 

Another way to define decreasing risk aversion is to check whether the risk 
premium goes down with wealth: 

Proposition 4.3 A utility function u is decreasingly risk averse if and only if for all 
a in A, the function co —> RP(a + co) is non increasing in co. 

A third way to think about decreasing risk aversion is to look at the derivatives 
of the utility function: 

Theorem 4.1 A utility function u is decreasingly risk averse if and only if the 
function 

u"(z) d 

Kz) = = --(log u'(z)) (4.13) 

u yz) dz 

is nonincreasing in z. 

While we do not provide a complete proof (see Pratt 1964), the following two results 
do a large share of the work: 

Proposition 4.4 If U\ and u 2 are such that ).i(z) > X 2 (z)for all z in Z, then uf z) = 
f(u 2 (z)) for some concave, strictly increasing function f from the range ofu 2 to Hi. 
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Proposition 4.5 A decision maker, with utility function u u is at least as risk 
averse as another decision maker, with utility function u 2 if and only if, for all z in 
Z, k,(z) > X 2 (z). 

The function X(z), defined in Theorem 4.1, is known as the Arrow-Pratt measure 
of risk aversion, or local risk aversion at z■ Pratt writes: 

we may interpret k(z) as a measure of the local risk aversion or local 
propensity to insure at the point z under the utility function m; —a(z) 
would measure locally liking for risk or propensity to gamble. Notice 
that we have not introduced any measure of risk aversion in the large. 
Aversion to ordinary (as opposed to infinitesimal) risks might be consid¬ 
ered measured by RP(a), but RP is a much more complicated function 
than X. Despite the absence of any simple measure of risk aversion in the 
large, we shall see that comparisons of aversion to risk can be made sim¬ 
ply in the large as well in the small. (Pratt 1964, p. 126 with notational 
changes) 

Example 4.1 Let o> be the decision maker’s wealth prior to a gamble a with small 
range for the rewards and such that the expected value is 0. Define 



and 


E[a m ] = JVMz). 


The decision maker’s risk premium for a + co is 


RP(a + co) = E[a + &>] — z*(a + co), 


which implies 


z*(a + co) — E[a + co] — RP{a + co) 


= co — RP(a + co). 


By the definition of certainty equivalent, 


E[u(a + co)] = u(z*(a + co)) 


— u(a> — RP(a + &>)). 
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Let us now expand both terms of this equality by using Taylor’s formula (in order to 
simplify the notation let RP(a + <y) = RP). We obtain 


RP~ 

u(co — RP) = u(w) — RPufco) H—— u"{a >) + ... 


E[u(co + n)] = E 


u(w) + aufco) + —a 2 u"((o) + — a 3 u"'(co) + . ■ ■ 


— u(co) + -E[a 2 ]u"(m) + —E[a 3 ]u'"(co) + _ 


(4.14) 


(4.15) 


By setting equation (4.14) equal to equation (4.15) while neglecting higher-order 
terms we obtain 

-RPufco) & ^E[a 2 ]u"(co). (4.16) 

Since E[a 2 ] — Var[a], and X(co) = —u"(co)/u'(w), 

X(u>) « 2RP/Var[a]- (4.17) 

that is, the decision maker’s risk aversion /,(&)) is twice the risk premium per unit of 
variance for small risks. ★ 


Corollary 1 A utility function u is constantly risk averse if and only if k(z) is 
constant, in which case there exist constants a > 0 and b such that 


u{z) = 


\az + b 
| —ae~ Xz + b 


if X(z) = 0, or 
if X(z) = X > 0. 


(4.18) 


If the amount of money involved is small compared to the decision maker’s initial 
wealth, a constant risk aversion is often considered an acceptable rule. 

The risk-aversion function captures the information on preferences in the 
following sense 


Theorem 4.2 7vvo utility functions m, and u 2 have the same preference ranking for 
any two lotteries if and only if they have the same risk-aversion function. 


Proof: If Mi and u 2 have the same preference ranking for any two lotteries, they 
are affine transformations of one another. That is, ufz) — a + hu 2 (z). Therefore, 
u\(z) = bu 2 (z ) and u'((z) = bu 2 (z), so 


*i(z) = - 


u'l(z) 

u'l (z) 


bu'^lz) 

bu' 2 (z ) 


X 2 (z). 


(4.19) 


Conversely, X(z) = —( d/dz ) (logu'(z)) and then 






and exp(— / X(z)dz) = e c u\z), which implies 
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J exp(— / X(z)dz)dz = J e c u\z)dz = e c u{z) + d. (4.20) 

Since e c and d are constants, X(z) specifies u(z) up to positive linear trans¬ 
formations. □ 

With regard to this result, Pratt comments that: 

the local risk aversion X associated with any utility function u contains 
all essential information about u while eliminating everything arbitrary 
about it. However, decisions about ordinary (as opposed to “small”) risks 
are determined by X only through u ... so it is not convenient entirely to 
eliminate u from consideration in favor of X. (Pratt 1964, p. 126 with 
notational changes) 


4.3 Utility functions for medical decisions 

4.3.1 Length and quality of life 

Decision theory is used in medicine in two scenarios: one is the evaluation of policy 
decisions that affect groups of individuals, typically carried out by cost-effectiveness, 
cost-utility or similar analyses. The other is decision making for individuals fac¬ 
ing complicated choices regarding their health. Though the two scenarios are very 
different from a decision-theoretic standpoint, in both cases the foundation is a mea¬ 
surement of utility for future health outcome. The reason why utility plays a key 
role is that simpler measures of outcome, like duration of life (or in medical jar¬ 
gon survival), fail to capture critical trade-offs. In this regard, McNeil et al. (1981) 
observe: 

The importance of integrating attitudes toward the length of survival and 
the quality of life is clear. First of all, it is known that persons have dif¬ 
ferent preferences for length of survival: some place greater value on 
proximate years than on distant years. ... Secondly, the burden of dif¬ 
ferent types of illnesses may vary from patient to patient and from one 
disorder to another. ... Thirdly, although some people would be willing 
to trade off some fraction of their lives to avoid morbidities ... if they 
had normal life expectancy, they might be willing to trade off smaller 
fractions of their lives if they had a shorter life expectancy. Thus, the 
importance of estimating the value of different states of health, assum¬ 
ing different potential lengths of survival, is critical. (McNeil et al. 1981, 
p. 987) 

Utility elicitation in medical decision making is complex, and the literature is 
extensive (Naglie et al. 1997, Chapman & Sonnenberg 2000). Here we first illustrate 
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the application of the NM theory, then present an alternative approach based on time 
trade-offs rather than probability trade-offs, and lastly discuss conditions for the two 
to be equivalent. 

4.3.2 Standard gamble for health states 

In Section 4.1 we described a general procedure for utility elicitation. We discuss 
next a mathematically equivalent example of utility elicitation using the standard 
gamble approach in a medical example. 

Example 4.2 A person with severe chronic pain has the option to have surgery that 
could remove the pain completely, with probability 80%, although there is a 4% risk 
of death from the surgery. In the remainder of the cases, the surgery has no effect. In 
this example, the worst outcome is death with utility 0, and the best is full recovery 
with no pain, with utility 1. For the intermediate outcome chronic pain, the standard 
gamble is shown in Figure 4.3 . Suppose that the patient’s indifference probability 
a is 0.85. This implies that the utility for chronic pain is 0.85. Thus, the expected 
utility for surgery is 0.04 x 0 + 0.16 x 0.85 + 0.8 x 1 = 0.936 which is larger than 
the expected utility of no surgery, that is 0.85. ★ 


4.3.3 The time trade-off methods 

In the standard gamble, the trade-off is between an intermediate option for sure and 
two extreme options with given probabilities. The time trade-off method (Torrance 
1971, Torrance et al. 1972) plays a similar game with time instead of probability. 
It is typically used to assess a patient’s attitude towards the number of years in ill 
health he or she is willing to give up in order to live in good health for a shorter 
number of years. The time in good health equivalent to a year of ill health is called 
the quality-adjusted life year (QALY). To implement this method we first pick an 
arbitrary period of time t in a particular condition, say chronic pain. Then, we find 
the amount of time in good health the patient considers equivalent to the arbitrary 


Sure Thing 


-► chronic pain 


Gamble 



Figure 4.3 Standard gamble for assessing utility for chronic pain. 
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Health 


Chronic 

pain 


Death 


Option 2 

Option 1 





0 x t 


TIME 


Figure 4.4 Time trade-off for assessing surgical treatment of chronic pain. Figure 
based on Torrance et al. (1972). 


period of time in the condition. For applications in the analysis of clinical trials 
results, see Glasziou et al. (1990). 

Example 4.3 (Continuation of the example of Section 4.3.2) Consider a time 
horizon of t = 20 years, in the ballpark of the patient’s life expectancy. This is not a 
choice to be taken lightly, but for now we will pretend it is easy. The time trade-off 
method offers two options to the patient, shown in Figure 4.4. Option 1 is t = 20 
years in chronic pain. Option 2, also deterministic, is x < t years in full health. 
Time x is varied until the patient is indifferent between the two options. The quality 
adjustment factor is q = x/t. If our patient is willing to give up 5 years if he or she 
could be certain to live without any pain for 15 years, then q — 15/20 = 0.75. The 
expected quality-adjusted life expectancy without surgery is 12 years, while with 
surgery it is 0.04 x 0 + 0.16 x 15 + 0.8 x 20 = 18.4, so surgery is preferred. ★ 


4.3.4 Relation between QALYs and utilities 

Weinstein et al. (1980) identified conditions under which the time trade-off method 
and the standard gamble method are equivalent. The first condition is that the utility 
of health states must be such that there is independence between length of life and 
quality of life. In other words, the trade-offs established on one dimension do not 
depend on the levels of the other dimension. The second condition is that the utility 
of health states must be such that trade-offs are proportional, in the following sense: 
if a person is indifferent between x years of life in good health and x' years of life 
in chronic pain, then he or she must also be indifferent between spending qx years 
of life in excellent health and qx' years in chronic pain. The third condition is that 
the individual must be risk neutral with respect to years of life. This means that the 
patient’s utility for living the next year is the same as for living each subsequent year 
(that is, his or her marginal utility is constant). The last assumption is not gener¬ 
ally considered very realistic. Moreover, that the time trade-off and standard gamble 
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Gamble 



t years in 
good health 


qt years 


t years 


good health 


chronic pain 


death 


Time Trade-Off 


Standard Gamble 


Figure 4.5 Relationship between the standard gamble and time trade-off methods. 


could lead to empirically different results even when the decision maker is close to 
risk neutrality, depending on how different aspects of utilities, such as desirability of 
states, and time preferences are empirically captured by these methods. 

If all these conditions hold, we can establish a correspondence between the qual¬ 
ity adjustment factor q elicited using the time trade-off method and the utility a. We 
wish to establish that, for a given health outcome z, held constant over a period of t 
years, at = qt no matter what t is. Figure 4.5 outlines the logic. Using the first con¬ 
dition we can establish trade-offs for length of life and quality of life independently. 
So, let us first consider the trade-offs for quality of life. Using the standard gamble 
approach, we find a such that a lottery with probability a for 1 unit of time in z° and 
(1 — a) in Zo is equivalent to a lottery which yields an intermediate outcome z for 1 
unit of time, for sure. Thus, the utility of outcome z for 1 unit of time is a. Using 
the risk-neutrality condition, the expected utility of t units of time in z is t x a. Next, 
let us consider the time trade-off method. From the proportional trade-offs condition, 
if a person is indifferent between 1 unit of time in outcome z and q units of time in z°, 
he or she is also indifferent between t units of time in outcome z and qt units of time 
in z°. So if these three conditions are true, both methods lead to the same evaluation 
for living t years in health outcome z. 


4.3.5 Utilities for time in ill health 

The logic outlined in the previous section can also be used to elicit patient’s utilities 
for a period of time in ill health in a two-step procedure. In the first step the standard 
gamble strategy is used to elicit the utility of a certain number of years in good health. 
In the second step the time trade-off is used to elicit patient’s preferences between 
living a shorter time in good health and a longer time in ill health. Alternatively, 
one could assess patient’s utility considering length and quality of life together via 
lotteries such as those in the NM theory. However, it is usually more difficult to think 
about length and quality of life at the same time. 
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Example 4.4 We will illustrate this using an example based on a patient currently 
suffering for some chronic condition; that is, whose current health state z is sta¬ 
ble over time and less desirable than full health. The numbers in the example are 
from Sox et al. (1988). Using the standard gamble approach, we can elicit patient 
utilities for different lengths of years in perfect health as shown by Table 4.1. This 
table requires three separate applications of the standard gamble: one for each of 
the three intermediate lengths of life. Panel (a) of Figure 4.6 shows an interpola¬ 
tion of the resulting mapping between life length and utility, represented by the solid 


Table 4.1 Patient’s utility function for the length 
of a healthy life. 


Years of perfect health 

Utility 

0 

0.00 

3 

0.25 

7 

0.50 

12.5 

0.75 

25 

1.00 


Utility Years of healthy life 



Utility 



Figure 4.6 Trade-off curves for the elicitation of utilities for time in ill health. 
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Table 4.2 Equivalent years of life and patient’s utility function for 
various lengths of life in ill health. 


Years of 

disabled life 

Equivalent years 
of perfect health 

Utility 

5 

5 

0.38 

10 

9 

0.59 

15 

13 

0.76 

20 

17 

0.84 

25 

21 

0.92 


line. The dashed diagonal line represents a person with constant marginal utility. The 
patient considers a 50/50 gamble between 0 (immediate death) and 25 years (full life 
expectancy) equivalent to a sure survival of 7 years (represented by a triangle in the 
figure). Thus, the patient would be willing to “trade off" the difference between the 
average life expectancy and the certainty equivalent, that is 12.5 — 1 = 5.5 years to 
avoid the risk. This is a complicated choice to think about in the abstract, but it is not 
unrealistic for patients facing surgeries that involve a serious risk of death. 

Next, we elicit how long a period of life in full health the patient considered to 
be equivalent to life in his or her current state. Responses are reported in the first 
two columns of Table 4.2 and shown in Figure 4.6(b). The solid line represents the 
relation for a person for whom full health and current health are valued the same. The 
dashed line represents the relation for a person for whom the current health condition 
represents an important loss of quality of life. The patient would “trade off” some 
years in his or her current health for a better health state. 

Finally, using both tables (and parts (a) and (b) of Figure 4.6) we can derive 
the patient’s utility for different durations of years of life with the current chronic 
condition. This is listed in the last column of Table 4.2 and shown in Figure 4.6(c), 
where the solid line is the utility for being in full health throughout one’s life, while 
the dashed line is the utility for the same duration with the current chronic condition. 
The derivation is straightforward. Consider the second row of Table 4.2: 10 years in 
the current health state are equivalent to 9 years in perfect health. Using Table 4.1, 
we find the utility using a linear interpolation. Let u x denote the utility for x years of 
healthy life which gives 


u g — 0.50 + 2 x 


0.75 - 0.50 

525 


0.59. 


Similar calculations give the remaining utility values shown in Table 4.2 and 
Figure 4.6(c). ★ 
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4.3.6 Difficulties in assessing utility 

The elicitation of reliable utilities is one of the major obstacles in the application 
of decision analysis to medical problems. Research on descriptive decision mak¬ 
ing compares preferences revealed by empirical observations of decisions to utilities 
ascertained by elicitation methods. The conclusion is generally that the assessed util¬ 
ity function may not reflect the patient’s true preferences. The difficulties arise from 
the fact that in many circumstances, important components of the decision prob¬ 
lem may not be reliably quantified. Factors such as attitudes in coping with disease 
and death, ethical or religious beliefs, etc., can lead to preferences that are more 
complex than what we can describe within the axiomatization of expected utility 
theory. 

Even if that is not the case, cognitive aspects may challenge utility assessment. 
Three categories of difficulties are often cited: framing effects, problems with small 
probability, and changes in utilities over time. Framing effects are artifacts and incon¬ 
sistencies that arise from how questions are formulated during elicitation, or on the 
description of the scenarios for the outcomes. For example, “the framing of bene¬ 
fit (or risk) in relative versus absolute terms may have a major influence on patient 
preferences” (Malenka et al. 1993). 

Very small or very high probability is a challenge even for decision makers 
who may have an intuitive understanding of the concept of probability. Yet these 
are pervasive in medicine: for example, all major surgeries involve a chance of seri¬ 
ously adverse outcomes; all vaccination decisions are based on trading off short-term 
symptoms for sure against a serious illness with very small probability. In such cases, 
a sensitivity analysis can be used to evaluate how changes in the utilities would 
affect the ultimate decision making. A related alternative is inverse decision the¬ 
ory (Swartz el al. 2006, Davies et al. 2007). This is applicable to cases where a small 
number of actions are available, and is based on partitioning the utility and proba¬ 
bility inputs into sets, each collecting all the inputs that lead to the same decision. 
For example, in Section 4.3.2 we would determine values of u for which surgery 
is the best strategy. The expected utility of no surgery is u, while for surgery it 
is 0.04 x 0 + 0.16 x u + 0.8 x 1. Thus, the optimal strategy is surgery whenever 
u < 0.95. A more general version would partition the probabilities as well. 

The patient’s attitudes towards length of life may change with age, or life 
expectancy. Should a patient anticipate these changes and decide now based on 
the prediction of what his or her utility are expected to become? This sounds ratio¬ 
nal but prohibitive to implement. The default approach is that the patient’s current 
preferences should be the basis for making decisions. 

Lastly, utilities vary widely across individuals, partly because their preferences 
do and partly as a result of challenges in measurement. For certain health states, such 
as the consequences of a major stroke, patients distribute across the whole range— 
with some patients ranking the health state worse than death, and others considering 
it close to full health (Samsa et al. 1988). This makes it difficult to use replication 
across individuals to increase the precision of estimates. 

For an extended review of these issues see also Chapman & Elstein (2000). 
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4.4 Exercises 

Problem 4.1 (Kreps 1988, problem 7) Suppose a decision maker has constant 
absolute risk aversion over the range —$100 to $1000. We ask her for her certainty 
equivalent for gamble with prizes $0 and $ 1000 , each with probability one-half, and 
she says that her certainty equivalent for this gamble is $488. What, then, should she 
choose, if faced with the choice of: 

a: a gamble with prizes —$100, $300, and $1000, each with probability 1/3; 
a': a gamble with prize $530 with probability 3/4 and $0 with probability 1/4; or 
a"', a gamble with a sure thing payment of $385? 

Solution 

Since the decision maker has constant absolute risk aversion over the range —$100 
to $ 1000 , we have 

u(z) = —ae~ Xz + b, for all z in [-100,1000]. (4.21) 

We know that the certainty equivalent for a 50/50 gamble with prizes $0 and $1000 
is $488. Therefore, 


n(488) = u( 0)^ + n(1000)^. (4.22) 

Suppose u(0) = 0, u( 1000) = 1, and consider equations (4.21) and (4.22). We 
have 


0 = — ae k0 + b 
1 = -ae- noo ° + b 
1 /2 = —ae _A488 + b. 

This system implies that a = b = 10.9207 and X — 0.0000960369. Therefore, we 
have 


u(z) = 10.9207(1 - e- 0 ' 00009603692 ), for all z in [-100,1000]. (4.23) 

From equation (4.23) we obtain 

n(-100) = -0.105 384 
«(300) = 0.310148 
m(385) = 0.396411 
m(530) = 0.541949. 
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The expected utilities associated with gambles a, a', and a" are 

U(a) = ^O(-IOO) + n(300) + m(1000)) = 0.401 588 
, 3 1 

U(a) = -u(530) + -u( 0) = 0.406462 
U(a") = u( 385) = 0.396411. 

Based on these values we conclude that the decision maker should choose gamble a' 
since it has the maximum expected utility. □ 

Problem 4.2 Find the following: 

(a) A practical decision problem where it is reasonable to assume that there are 
only four relevant outcomes zu ■ ■ ■, Z 4 - 

(b) A friend willing to waste 15 minutes. 

(c) Your friend’s utilities u(zi), ..., m(z 4 ). 

Here’s the catch. You can only ask your friend questions about preferences for 
von Neumann-Morgenstern lotteries. You can assume that your friend believes that 
Axioms NM1, NM2, and NM3 are reasonable. 

Problem 4.3 Prove part (b) of Proposition 4.2. 

Problem 4.4 Was Bernoulli risk averse? Decreasingly risk averse? 

Problem 4.5 (Lindley 1985) A doctor has the task of deciding whether or not to 
carry out a dangerous operation on a person suspected of suffering from a disease. If 
he has the disease and does operate, the chance of recovery is only 50%; without an 
operation the similar chance is only 1 in 20. On the other hand if he does not have 
the disease and the operation is performed there is 1 chance in 5 of his dying as a 
result of the operation, whereas there is no chance of death without the operation. 
Advise the doctor (you may assume there are always only two possibilities, death or 
recovery). 

Problem 4.6 You are eliciting someone’s utility for money. You know this person 
has constant risk aversion in the range $0 to $1000. You propose gambles of the form 
$0 with probability p and $1000 with probability 1 — p for the following four val¬ 
ues of p: 1/10, 1/3, 2/3, and 9/10. You get the following certainty equivalents: 0.25, 
0.60, 0.85, and 0.93. Verify that these are not consistent with constant risk aversion. 
Assuming that the discrepancies are due to difficulty in the exact elicitation on cer¬ 
tainty equivalents, rounding, etc., find a utility function with constant risk aversion 
that closely approximates the elicited certainty equivalents. Justify briefly the method 
you use for choosing the approximation. 


72 


DECISION THEORY: PRINCIPLES AND APPROACHES 


Problem 4.7 (From French 1988) An investor has $1000 to invest in two types 
of shares. If he invests $a in share A, he will invest $(1000 — a) in share B. An 
investment in share A has a 0.7 chance of doubling value and a 0.3 chance of being 
lost altogether. An investment in share B has a 0.6 chance of doubling value and 
a 0.4 chance of being lost altogether. Outcomes of the two investments are statisti¬ 
cally independent. Determine the optimal value of a if the utility function is u(z) = 
log(z + 3000). 

Problem 4.8 An interesting special case of Example 4.1 happens when Z = 
{— k, k}, k > 0. As in that example, assume that co is the decision maker’s initial 
wealth. The probability premium p{co, k) of a is defined as the difference a(k) — a(—k) 
which makes the decision maker indifferent between the status “quo” and a risk z in 
{— k, k}. Prove that ).((») is twice the probability premium per unit risked for small 
risks (Pratt 1964). 

Problem 4.9 The standard gamble technique has to be slightly modified for chronic 
states considered worse than death as shown in Figure 4.7. Show that the utility of 
the chronic state is u(z) = —a/(l — a). Moreover, show that the quality-adjusted life 
year is given by q = x/{x — t). 

Problem 4.10 A machine can be functioning (F), in repair (R), or dead (D). Under 
the normal course of operations, the probability of making a transition between any 
of the three states, in a day, is given by this table: 


FROM 




TO 



F 

R 

D 

F 

0.92 

0.06 

0.02 

R 

0.45 

0.45 

0.10 

D 

0.00 

0.00 

1.0 


Sure Thing 


1.0 


death 


Gamble 



Figure 4.7 Modified standard gamble for chronic states considered worse than 
death. 
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These probabilities only depend on where the machine is today, and not on its 
past history of repairs, age, etc. This is obviously an unreasonable assumption; if 
you want to relax it, be our guest. 

When the machine functions, the daily income is $1000. When it is in repair, the 
daily cost is $150. There is no income after the machine is dead. Suppose you can 
put in place a regimen that gives a lower daily income ($900), but also decreases 
the probability that the machine needs repair or dies. Specifically, the transition table 
under the new regimen is 

TO 

F R D 
F 0.96 0.03 0.01 

FROM R 0.45 0.45 0.10 

D 0.00 0.00 1.0 


New regimen or old? Assume the decision maker is risk neutral and that the 
machine is functioning today. You can use an analytic approach or a simulation. 

Problem 4.11 Describe how you would approach the problem of estimating the 
cost-effectiveness of regulating emission of pollutants into the environment. Pick 
a real or hypothetical pollutant/regulation combination, and briefly describe data 
collection, modeling, and utility elicitation. 
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Ramsey and Savage 


In the previous two chapters we have studied the von Neumann-Morgenstern 
expected utility theory (NM theory) where all uncertainties are represented by objec¬ 
tive probabilities. Our ultimate goal, however, is to understand problems where 
decision makers have to deal with both the utility of outcomes, as in the NM theory, 
and the probabilities of unknown states of the world, as in de Finetti’s coherence 
theory. 

For an example, suppose your roommate is to decide between two bets: bet 

(i) gives $5 if Duke and not UNC wins this year’s NCAA final, and $0 otherwise; bet 

(ii) gives $10 if a fair die that I toss comes up 3 or greater. The NM theory is not rich 
enough to help in deciding which lottery to choose. If we agree that the die is fair, 
lottery (ii) can be thought of as a NM lottery, but what about lottery (i)? Say your 
roommate prefers lottery (ii) to lottery (i). Is it because of the rewards or because of 
the probabilities of the two events? Your roommate could be almost certain that Duke 
will win and yet prefer (ii) because he or she desperately needs the extra $5. Or your 
roommate could believe that Duke does not have enough of a chance of winning. 

Generally, based on agents’ preferences, it is difficult to understand their proba¬ 
bility without also considering their utility, and vice versa. “The difficulty is like that 
of separating two different co-operating forces” (Ramsey 1926, p. 172). However, 
several axiom systems exist that achieve this. The key is to try to hold utility consid¬ 
erations constant while constructing a probability, and vice versa. For example, you 
can ask your roommate whether he or she would prefer lottery (i) to (iii), where you 
win $0 if Duke wins this year’s NCAA championship, and you win $5 otherwise. If 
your roommate is indifferent then it has to be that he or she considers Duke winning 
and not winning to be equally likely. You can then use this as though it were a NM 
lottery in constructing a utility function for different sums of money. 


Decision Theory: Principles and Approaches G. Parmigiani, L. Y. T. Inoue 
© 2009 John Wiley & Sons, Ltd 
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In a fundamental publication titled Truth and probability, written in 1926 and 
published posthumously in 1931, Frank Plumpton Ramsey developed the first axiom 
system on preferences that yields an expected utility representation based on unique 
subjective probabilities and utilities (Ramsey 1926). A full formal development of 
Ramsey’s outline had to wait until the book The foundations of statistics, by Leonard 
J. Savage, “The most brilliant axiomatic theory of utility ever developed” (Fishburn 
1970). After summarizing Ramsey’s approach, in this chapter we give a very brief 
overview of the first five chapters of Savage’s book. As his book’s title implies, Sav¬ 
age set out to lay the foundation to statistics as a whole, not just to decision making 
under uncertainty. As he points out in the 1971 preface to the second edition of 
his book: 


The original aim of the second part of the book... is... a personal- 
istic justification... for the popular body of devices developed by the 
enthusiastically frequentist schools that then occupied almost the whole 
statistical scene, and still dominate it, though less completely. The sec¬ 
ond part of this book is indeed devoted to personalistic discussion of 
frequentist devices, but for one after the other it reluctantly admits that 
justification has not been found. (Savage 1972, p. iv) 


We will return to this topic in Chapter 7, when discussing the motivation for the 
minimax approach. 

Featured readings: 

Ramsey, F. (1926). The Foundations of Mathematics, Routledge & Kegan Paul, 
London, chapter Truth and Probability, pp. 156-211. 

Savage, L. J. (1954). The foundations of statistics, John Wiley & Sons, Inc., New 
York. 


Critical reviews of Savage’s work include Fishburn (1970), Fishburn (1986), 
Shafer (1986), Lindley (1980), Dreze (1987), and Kreps (1988). A memorial col¬ 
lection of writings also includes biographical materials and transcriptions of several 
of Savage’s very enjoyable lectures (Savage 1981b). 


5.1 Ramsey’s theory 

In this section we review Ramsey’s theory, following Muliere & Parmigiani (1993). 
A more extensive discussion is in Fishburn (1981). Before we present the technical 
development, it is useful to revisit some of Ramsey’s own presentation. In very few 
pages, Ramsey not only laid out the game plan for theories that would take decades 


RAMSEY AND SAVAGE 


77 


to unfold and are still at the foundation of our field, but also anticipated many of the 
key difficulties we still grapple with today: 

The old-established way of measuring a person’s belief is to propose a 
bet, and see what are the lowest odds that he will accept. This method 
I regard as fundamentally sound, but it suffers from being insufficiently 
general and from being necessarily inexact. It is inexact, partly because 
of the diminishing return of money, partly because the person may have 
a special eagerness or reluctance to bet, because he either enjoys or 
dislikes excitement or for any other reason, e.g. to make a book. The 
difficulty is like that of separating two different co-operating forces. 
Besides, the proposal of a bet may inevitably alter his state of opin¬ 
ion; just as we could not always measure electric intensity by actually 
introducing a charge and seeing what force it was subject to, because the 
introduction of the charge would change the distribution to be measured. 

In order therefore to construct a theory of quantities of belief which 
shall be both general and more exact, I propose to take as a basis a 
general psychological theory, which is now universally discarded, but 
nevertheless comes, I think, fairly close to the truth in the sort of cases 
with which we are most concerned. I mean the theory that we act in the 
way we think most likely to realize the objects of our desires, so that a 
person’s actions are completely determined by his desires and opinions. 

This theory cannot be made adequate to all the facts, but it seems to me 
a useful approximation to the truth particularly in the case of our self- 
conscious or professional life, and it is presupposed in a great deal of 
our thought. It is a simple theory and one which many psychologists 
would obviously like to preserve by introducing unconscious desires 
and unconscious opinions in order to bring it more into harmony with 
the facts. How far such fictions can achieve the required result I do not 
attempt to judge: I only claim for what follows approximate truth, or 
truth in relation to this artificial system of psychology, which like New¬ 
tonian mechanics can, I think, still be profitably used even though it is 
known to be false. (Ramsey 1926, pp. 172-173) 

Ramsey develops his theory for outcomes that are numerically measurable and 
additive. Outcomes are not monetary nor necessarily numerical, but each outcome 
is assumed to carry a value. This is a serious restriction and differs significantly 
from NM theory and, later. Savage, but it makes sense for Ramsey, whose primary 
goal was to develop a logic of the probable, rather than a general theory of rational 
behavior. The set of outcomes is Z. As before we consider a subject that has a weak 
order on outcome values. Outcomes in the same equivalence class with respect to 
this order are indicated by zi ~ Zi, while strict preference is indicated by >-. 

The general strategy of Ramsey is: first, find a neutral proposition with subjective 
probability of one-half, then use this to determine a real-valued utility of outcomes, 
and finally, use the constructed utility function to measure subjective probability. 
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Ramsey considers an agent making choices between options of the form “z,i if 0 
is true, z 2 if 9 is not true.” We indicate such an option by 


a(0) = 



if 9 

if not 9. 


The outcome z and the option 


\z if 9 
jz if not 9 


(5.1) 


(5.2) 


belong to the same equivalence class, a property only implicitly assumed by Ramsey, 
but very important in this context as it represents the equivalent of reflexivity. 

In Ramsey’s definition, an ethically neutral proposition is one whose truth or 
falsity is not “an object of desire to the subject.” More precisely, a proposition 9 0 is 
ethically neutral if two possible worlds differing only by the truth of 0 O are equally 
desirable. Next Ramsey defines an ethically neutral proposition with probability | 
based on a simple symmetry argument: Jt(9 0 ) = \ if for every pair of outcomes 
(Z 1 .Z 2 X 


a(9 0 ) = 



if &o 
if not 9 0 



if 0o 

if not 0 O . 


(5.3) 


The existence of one such proposition is postulated as an axiom: 


Axiom R1 There is an ethically neutral proposition believed to degree 1/2. 

The next axiom states that preferences among outcomes do not depend on which 
ethically neutral proposition with probability is chosen. There is no loss of gener¬ 
ality in taking n (0) = \: the same construction could have been performed with a 
proposition of arbitrary probability, as long as such probability could be measured 
based solely on preferences. 


Axiom R2 The indifference relation (5.3) still holds if we replace 0 O with any other 
ethically neutral event 0g. 

Axiom R2a If 9 is an ethically neutral proposition with probability 1, we have 
that if 

jzi if 0 O f Z 2 if 0o 

I z 4 if not 0 O 1 z 3 if not 0 O 

then z\ > z,2 if and only if z 3 > Z4, and Zi ~ z 2 if and only if z 3 ~ z 4 . 

Axiom R3 The indifference relation between actions is transitive. 

Axiom R4 If 

Jzi if 0 O J z 2 if 0 O 

I z 4 if not 0 O I z 3 if not 0 O 
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and 


then 


Z3 

if 0o 


(Z4 

if 0o 

_Z5 

if not 

0o ~ ‘ 

| Zs 

if not 0 O 

Zi 

if 0o 

J 

\z 2 

if 0o 

Zs 

if not 


[ze 

if not 0 O . 


Axiom R5 For every Zi, z 2 , z 3 , there exists a unique outcome z such that 


| Zi if 0 0 

I z 2 if not 0 O 


\z if 0o 

I z 3 if not 0 O . 


Axiom R6 For every pair of outcomes (zi,z 2 ), there exists a unique outcome z 
such that 


| Zi if 0 O 
z 2 if not 0 O 


Xz- 


This is a fundamental assumption: as Xz ~ Z, this implies that there is a unique 
certainty equivalent to a. We indicate this by z*(a). 

Axiom R7 Axiom of continuity: Any progression has a limit (ordinal). 

Axiom R8 Axiom of Archimedes. 

Ramsey’s original paper provides little explanation regarding the last two axioms, 
which are reported verbatim here. Their role is to make the space of outcomes rich 
enough to be one-to-one with the real numbers. Sahlin (1990) suggests that conti¬ 
nuity should be the analogue of the standard completeness axiom of real numbers. 
Then it would read like “every bounded set of outcomes has a least upper bound.” 
Here the ordering is given by preferences. 

We can imagine formalizing the Archimedean axiom as follows: for every Zi > 
z 2 > Zi- there exist z and z such that 


and 


| z if 0 O 

I z 2 if not 0 C 


\i if 0 0 

Z 2 if not 0 O 


-< Z3 


>- Zi- 


Axioms R1-R8 are sufficient to guarantee the existence of a real-valued, one-to- 
one, utility function on outcomes, designated by u, such that 


| Zi if 0 O 
z 4 if not 0 O 


| Z 2 if 0o 

I z 3 if not 0 O 


«(zi) - u(z 2 ) — u(zi) - n(z 4 ). (5.4) 
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Ramsey did not complete the article before his untimely death at 26, and the posthu¬ 
mously published draft does not include a complete proof. Debreu (1959) and 
Pfanzagl (1959), among others, designed more formal axioms based on which they 
obtain results similar to Ramsey’s. 

In (5.4), an individual’s preferences are represented by a continuous and strictly 
monotone utility function u, determined up to a positive affine transformation. Con¬ 
tinuity stems from Axioms R2a and R6. In particular, consistent with the principle 
of expected utility, 

Jzi if 0 O 

I z 4 if not 0 O 


I z 2 if 0o m(zi) + k(z 4 ) _ m(z 2 ) + u(z 3 ) 

Z 3 if not 6 0 2 2 


Having defined a way of measuring utility, Ramsey can now derive a way of 
measuring probability for events other than the ethically neutral ones. Paraphrasing 
closely his original explanation, if the option of z* for certain is indifferent with that 
of Zi if 0 is true and z 2 if 0 is false, we can define the subject’s probability of 0 as the 
ratio of the difference between the utilities of z* and z 2 to that between the utilities 
of Zi and z 2 , which we must suppose the same for all the z*. Zi, and z 2 that satisfy the 
indifference condition. This amounts roughly to defining the degree of belief in 0 by 
the odds at which the subject would bet on 0, the bet being conducted in terms of 
differences of value as defined. 

Ramsey also proposed a definition for conditional probabilities of 0, given that 
0 2 occurred, based on the indifference between the following actions: 


Zi if 0 2 

versus 

Z 2 if not 0i 


z 3 if 0 2 and 0 t 

z 4 if 0 2 and not 6i 

Z 2 if not 0 2 . 


(5.6) 


Similarly to what we did with called-off bets, the conditional probability is the ratio 
of the difference between the utilities of Z\ and z 2 to that between the utilities of 
z 3 and - 4 , which we must again suppose the same for any set of z that satisfies the 
indifference condition above. Interestingly, Ramsey observes: 


This is not the same as the degree to which he would believe 0 l5 if 
he believed 0 2 for certain; for knowledge of 0 2 might for psychologi¬ 
cal reasons profoundly alter his whole system of beliefs. (Ramsey 1926, 

p. 180) 


This comment is an ancestor of the concern we discussed in Section 2.2, about 
temporal coherence. We will return to this issue in Section 5.2.3. 
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Finally, he proceeds to prove that the probabilities so derived satisfy what he calls 
the fundamental laws, that is the Kolmogorov axioms and the standard definition of 
conditional probability. 


5.2 Savage’s theory 

5.2.1 Notation and overview 

We are now ready to approach Savage’s version of this story. Compared to the 
formulations we encountered so far. Savage takes a much higher road in terms 
of the mathematical generality, dealing with continuous variables and general out¬ 
comes and parameter spaces. Let Z denote the set of outcomes (or consequences, or 
rewards) with a generic element z. Savage described an outcome as “a list of answers 
to all the questions that might be pertinent to the decision situation at hand” (Savage 
1981a). As before, © is the set of possible states of the world. Its generic element is 
0. An act or action is a function a : 0 —> Z from states to outcomes. Thus, a{6) is 
the consequence of taking action a if the state of the world turns out to be 6. The set 
of all acts is A. Figure 5.1 illustrates the concept. This is a good place to point out 
that it is not in general obvious how well the consequences can be separated from the 
states of the world. This difficulty motivates most of the discussion of Chapter 6 so 
we will postpone it for now. 


ACT 1 : Take umbrella 



ACT 2 : Leave umbrella at home 



0= {“rain”,“shine”} and Z = {“dry”,“wet”,“dry,carry umbrella”} 


Figure 5.1 Two acts for modeling the decision of whether or not to take an umbrella 
to work. 
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Before we dive into the axioms, we outline the general plan. It has points in 
common with Ramsey’s except it can count on the NM theory and maneuvers to 
construct the probability first, to be able to leverage it in constructing utilities: 

1. Use preferences over acts to define a “more probable than” relationship among 
states of the world (Axioms 1-5). 

2. Use this to construct a probability measure on the states of the world (Axioms 
1 - 6 ). 

3. Use the NM theory to get an expected utility representation. 

One last reminder: there are no compound acts (a rather artificial, but very powerful 
construct) and no physical chance mechanisms. All probabilities are subjective and 
will be derived from preferences, so we will have some hard work to do before we 
get to familiar places. 

We follow the presentation of Kreps (1988), and give the axioms in his same 
form, though in slightly different order. When possible, we also provide a notation 
of the name the axiom has in Savage’s book—those are names like PI and P2 (for 
postulate). Full details of proofs are given in Savage (1954) and Fishbum (1970). 

5.2.2 The sure thing principle 

We start, as usual, with a binary preference relation and a set of states that are not so 
boring to generate complete indifference. 

Axiom SI >- on A is a preference relation (that is, the >- relation is complete and 
transitive). 

Axiom S2 There exist z and Zi in Z such that Zi >- Z 2 - 

These are Axioms PI and P5 in Savage. 

A cornerstone of Savage’s theory is the sure thing principle: 

A businessman contemplates buying a certain piece of property. He 
considers the outcome of the next presidential election relevant to the 
attractiveness of the purchase. So, to clarify the matter to himself, he 
asks whether he would buy if he knew that the Republican candidate 
were going to win, and decides that he would do so. Similarly, he con¬ 
siders whether he would buy if he knew that the Democratic candidate 
were going to win, and again finds that he would do so. Seeing that he 
would buy in either event, he decides that he should buy, even though he 
does not know which event obtains, or will obtain, as we would ordinar¬ 
ily say. It is all too seldom that a decision can be arrived at on the basis 
of the principle used by this businessman, but, except possible for the 
assumption of simple ordering, I know of no other extralogical principle 
governing decisions that finds such ready acceptance. 
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Having suggested what I will tentatively call the sure-thing principle, 
let me give it relatively formal statement thus: If the person would not 
prefer a, to a 2 either knowing that the event © 0 obtained, or knowing 
that the event ©[,, then he does not prefer to a 2 . (Savage 1954, p. 21, 
with notational changes) 

Implementing this principle requires a few more steps, necessary to define more 
rigorously what it means to “prefer a, to a 2 knowing that the event © 0 obtained.” 
Keep in mind that the deck we are playing with is a set of preferences on acts, and 
those are functions defined on the whole set of states ©. To start thinking about 
conditional preferences, consider the acts shown in Figure 5.2. Acts a i and a 2 are 
different from each other on the set © 0 , and are the same on its complement ©[>. Say 
we prefer a l to a 2 . Now look at a\ versus a' 2 . On © 0 this is the same comparison 
we had before. On ©[>, a' and a 2 are again equal to each other, though their actual 
values have changed from the previous comparison. Should we also prefer a\ to tif! 
Axiom S3 says yes: if two acts are equal on a set (in this case ©[>), the preference 
between them is not allowed to depend on the common value taken in that set. 



Figure 5.2 The acts that are being compared in Axiom S3. Curves that are very 
close to one another and parallel are meant to be on top of each other, and are 
separated by a small vertical shift only so you can tell them apart. 
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Axiom S3 Suppose that a, , a \, a 2 , a' 2 in A and O 0 C 0 are such that 

(i) af9) = a\(9) and a 2 (9 ) = a 2 (9) for all 9 in 0 O ; and 

(ii) af9) = a 2 (9) and a[(9) = a' 2 (9) for all 9 in 0g. 

Then >- a 2 if and only if a\ > a 2 . 

This is Axiom P2 in Savage, and resembles the independence axiom NM2, except 
that here there are no compound acts and the mixing is done by the states of nature. 

Axiom S3 makes it possible to define the notion, fundamental to Savage’s theory, 
of conditional preference. To define preference conditional on 0 O , we make the two 
acts equal outside of © 0 and then compare them. Because of Axiom S3, it does not 
matter how they are equal outside of 0 O . 

Definition 5.1 (Conditional preference) We say that a, > a 2 given © 0 if and only 
ifa\ >- a' 2 , where a\ = a { and a 2 = a 2 on © 0 and a 2 — a\ on 0jj. 

Figure 5.3 describes schematically this definition. The setting has the flavor of a 
called-off bet, and it represents a preference stated before knowing whether 0 O 



Figure 5.3 Schematic illustration of conditional preference. Acts a\ and a 2 do not 
necessarily have to be constant on 0Jj. Again, curves that are very close to one 
another and parallel are meant to be on top of each other, and are separated by 
a small vertical shift only so you can tell them apart. 
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occurs or not. Problem 5.1 helps you tie this back to the informal definition of the 
sure thing principle given in Savage’s quote. In particular, within the Savage axiom 
system, if a >- a' on © 0 and a > a' on ©fj, then a >- a!. 

5.2.3 Conditional and a posteriori preferences 

In Sections 2.1.4 and 5.1 we discussed the correspondence between conditional 
probabilities, as derived in terms of called-off bets, and beliefs held after the condi¬ 
tioning event is observed. A similar issue arises here when we consider preferences 
between actions, given that the outcome of an event is known. This is a critical 
concern if we want to make a strong connection between axiomatic theories and 
statistical practice. Pratt et al. (1964) comment that while conditional preferences 
before and after the conditioning event is observed can reasonably be equated, the 
two reflect two different behavioral principles. Therefore, they suggest, an additional 
axiom is required. Restating their axiom in the context of Savage’s theory, it would 
read like: 

Before/after axiom The a posteriori preference > a 2 given knowledge that © 0 
has occurred holds if and only if the conditional preference > a 2 given © 0 holds. 

Here the conditional preference could be defined according to Definition 5.1. This 
axiom is not part of Savage’s formal development. 

5.2.4 Subjective probability 

In this section we are going to focus on extracting a unique probability distribu¬ 
tion from preferences. We do this in stages, first defining qualitative probabilities, 
or “more likely than” statements, and then imposing additional restrictions to derive 
the quantitative version. Dealing with real-valued 0 makes the analysis somewhat 
complicated. We are showing the tip of a big iceberg here, but the submerged part 
is secondary to the main story of our book. Real analysis aficionados will enjoy it, 
though: Rreps (1988) has an entire chapter on it, and so does DeGroot (1970). Inter¬ 
estingly, DeGroot uses a similar technical development, but assumes utilities and 
probabilities (as opposed to preferences) as primitives in the theory, thus bypassing 
entirely the need to axiomatize preferences. 

The first definition we need is that of a null state. If asked to compare two acts 
conditional on a null state, the agent will always be indifferent. Null states will turn 
out to have a subjective probability of zero. 

Definition 5.2 (Null state) ©o c © is called null if a ~ a' given © 0 for all 
a, a' in A. 

The next two axioms are making sure we can safely hold utility-like considera¬ 
tions constant in teasing probabilities out of preferences. The first, Axiom S4: 

is so couched as not only to assert that knowledge of an event can¬ 
not establish a new preference among consequences or reverse an old 
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one, but also to assert that, if the event is not null, no preference among 
consequences can be reduced to indifference by knowledge of an event. 
(Savage 1954, p. 26) 

Axiom S4 If: 

(i) ©o is not null; and 

(ii) a{6) = zi and a'(6) = z 2 , for all 6 e © 0 , 
then a > a' given © 0 if and only if Zi >- Zi- 

This is Axiom P3 in Savage. Figure 5.4 describes schematically this condition. 

Next, Axiom S5 goes back to the simple binary comparisons we used in the 
preface of this chapter to understand your roommate probabilities. Say he or she 
prefers a,: $10 if Duke wins the NCAA championship and $0 otherwise to a 2 ‘. $10 if 
UNC wins and $0 otherwise. Should your roommate’s preferences remain the same 
if you change the $10 to $8 and the $0 to $1? Axiom S5 says yes, and it makes it a 
lot easier to separate the probabilities from the utilities. 



Figure 5.4 Schematic illustration of Axiom S4. 
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Axiom S5 Suppose that z,, z 2 , f , z' 2 e Z, a, , a\ , a 2 , a ' 2 in A, and © 0 , 0, C © are 
such that: 

(i) Zi > Z2 and z' >- z' 2 ', 

(ii) af9) = Zi and a 2 (9) = z\ on © 0 and a x {6) = Zi and a 2 (0) = z 2 on 0^; 

(iii) a\(9) = Zi and a' 2 {0) = z\ on ©! and a[(9) = Z 2 and a' 2 {9) = z' 2 on ©;; 
then a.\ > a\ if and only if a 2 > a' 2 . 

Figure 5.5 describes schematically this axiom. This is Axiom P4 in Savage. 

We are now equipped to figure out, from a given set of preferences between 
acts, which of two events an agent considers more likely, using precisely the simple 
comparison considered in Axiom S5. 

Definition 5.3 (More likely than) For any two © 0 , ©! in ©, we say that © 0 is more 
likely than ©1 (denoted by © 0 >- ©i) if for every Zt , Z 2 such that Z\ >- Zi and a,a' 
defined as 

a{9)= \ Zl a'(9)=\ Zl if6e@l 

[z 2 if9e@ c 0 \z 2 if 9 e ©J 

then a >- a'. 



Figure 5.5 The acts in Axiom S5. 
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This is really nothing but our earlier Duke/UNC example dressed up in style. For 
your roommate, a Duke win must be more likely than a UNC win. 

If we tease out this type of comparison for enough sets, we can define what is 
called a qualitative probability. Formally: 

Definition 5.4 (Qualitative probability) The binary relation > between sets is a 
qualitative probability whenever: 

1. > is asymmetric and negatively transitive; 

2. ©o >- 0 for all © 0 (subset of ©); 

3. 0 >- 0; and 

4. if® o fl ©2 = ©i (T © 2 = 0, then @ 0 >- ©i if and only if® 0 U © 2 >- ©i U © 2 . 

As you may expect, we have done enough work to get a qualitative probability 
out of the preferences: 

Theorem 5.1 If S1-S5 hold then the binary relation > on © is a qualitative 
probability. 

Qualitative probabilities are big progress, but Savage is aiming for the good old 
quantitative kind. It turns out that if you have a quantitative probability, and you 
say that an event © 0 is more likely than 0, if©,, has a larger probability than ©,, 
then you define a legitimate qualitative probability. In general, however, the quali¬ 
tative probabilities will not give you quantitative ones without additional technical 
conditions that can actually be a bit strong. 

One way to get more quantitative is to be able to split up the set 0 into an 
arbitrarily large number of equivalent subsets. Then the quantitative probability of 
each of these must be the inverse of this number. De Finetti (1937) took this route 
for subjective probabilities, for example. Savage steps in this direction somewhat 
reluctantly: 

It may fairly be objected that such a postulate would be flagrantly ad hoc. 

On the other hand, such a postulate could be made relatively acceptable 
by observing that it will obtain if, for example, in all the world there is 
a coin that the person is firmly convinced is fair, that is, a coin such that 
any finite sequence of heads and tails is for him no more probable than 
any other sequence of the same length; though such a coin is, to be sure, 
a considerable idealization. (Savage 1954, p. 33) 

To follow this thought one more step, we could impose the following sort of 
condition: given any two sets, you can find a partition fine enough that you can take 
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the less likely of the two sets, and merge it with any element of the partition without 
altering the fact that it is less likely. More formally: 


Definition 5.5 (Finite partition condition) 7/0 0 , ©i in © are such that 0 O >- ©,, 
then there exists a finite partition {E l , o/0 such that 0 O >- 0 t U E k ,for every 

k — 1,... ,/i. 


This would gets us to the finish line as far as the probability part is concerned, 
because we could prove the following. 


Theorem 5.2 The relationship > is a qualitative probability and satisfies the 
axiom on the existence of a finite partition if and only if there exists a unique 
probability measure it such that 


(i) ©o >- ©i if and only ifjt(® 0 ) > Jti®^; 

(ii) for all © 0 C 0 and k e [0, 1], there exists ©i C ©, such that ^(©i) = 
k7T(® 0 ). 


See Kreps (1988) for details. 

Savage approaches this slightly differently and embeds the finite partition condi¬ 
tion into his Archimedean axiom, so that it is stated directly in terms of preferences— 
a much more attractive approach from the point of view of foundations than imposing 
restrictions on the qualitative probabilities directly. The requirement is that you can 
split up the set © in small enough pieces so that your preference will be unaffected 
by an arbitrary change of consequences within any one of the pieces. 


Axiom S6 (Archimedean axiom) For all a.\, a 2 in A such that a t > a 2 and z e Z, 
there exists a finite partition of 0 such that for all © 0 in the partition: 


(i) a\{6) = z for 9 e © 0 and a\{9) = ai(9) for 9 e ©J,; then a[ >- a 2 ; or 

(ii) a' 2 (9) = z for 9 e 0 O and a! 2 {9) = a 2 (9) for 9 e ©|j; then a i >- a! 2 . 


This is P6 in Savage. Two important consequences of this are that there are no con¬ 
sequences that are infinitely better or worse than others (similarly to NM3), and also 
that the set of states of nature is rich enough that it can be split up into very tiny 
pieces, as required by the finite partition condition. A discrete space may not satisfy 
this axiom for all acts. 
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It can in fact be shown that in the context of this axiom system the Archimedean 
condition above guarantees that the finite partition condition is met: 

Theorem 5.3 If S!-S 5 and S6 hold then > satisfies the axiom on the existence of a 
finite partition of®. 

It then follows from Theorem 5.2 that the above gives a necessary and sufficient 
condition for a unique probability representation. 

5.2.5 Utility and expected utility 

Now that we have a unique n on 0, and the ability to play with very fine partitions of 
©, we can follow the script of the NM theory to derive utilities and a representation 
of preferences. To be sure, one more axiom is needed, to require that if you prefer a 
to having any consequence of a' for sure, then you prefer a to a 1 . 

Axiom S7 For all @ 0 c ©: 

(i) If a > a\0 ) given © 0 for all 0 in © 0 , then a >- a’ given @ 0 . 

(ii) If a'{0) >- a given © 0 for all 0 in © 0 , then a! >- a given © 0 . 

This is Axiom P7 in Savage, and does a lot of the work needed to apply the theory 
to general consequences. Similar conditions, for example, can be used to generalize 
the NM theory beyond the finite Z we discussed in Chapter 3. 

We finally laid out all the conditions for an expected utility representation. 

Theorem 5.4 Axioms (S1S7) are sufficient for the following conclusions: 

(a) >- as defined above is a qualitative probability, and there exists a unique 

probability measure it on® such that 0 O >- ©i if and only ifjt(® 0 ) > :r(©i). 

(b) For all 0 O C © and k e [0,1], there exists a subset © t of © 0 such that 

7 r(©i) = kn(® 0 ). 

(c) For it given above, there is a bounded utility function u : Z —> SH such that 
a >- a' if and only if 

U n (a) = I u(a(6))7T(9)dO > j u(a’{0))n{0)d9 = U„(a'). 

J& J& 

Moreover, this u is unique up to a positive affine transformation. 

When the space of consequences is denumerable, we can rewrite the expected 
utility of action a in a form that better highlights the parallel with the NM theory. 
Defining 


piz) = tt{ 9 : a{9) = z} 
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we have the familiar 

Uv (a) = ^2 u(z)p(z) 

Z 

except now p reflects a subjective opinion and not a controlled chance mechanism. 

Wakker (1993) discusses generalizations of this theorem to unbounded utility. 

5.3 Allais revisited 

In Section 3.5 we discussed Allais’ claim that sober decision makers can still violate 
the NM2 axiom of independence. The same example can be couched in terms of 
Savage’s theory. Imagine that the lotteries of Figure 3.3 are based on drawing at 
random an integer number 9 between 1 and 100. Then for any decision maker that 
trusts the drawing mechanism to be random tt{9) = 0.01. The lotteries can then be 
rewritten as shown in Table 5.1. 

This representation is now close to Savage’s sure thing principle (Axiom S3). 
If a number between 12 and 100 is drawn, it will not matter, in either situation, 
which lottery is chosen. So the comparisons should be based on what happens if a 
number between 1 and 11 is drawn. And then the comparison is the same in both 
situations. Savage thought that this conclusion “has a claim to universality, or objec¬ 
tivity” (Savage 1954). Allais obviously would disagree. If you want to know more 
about this debate, which is still going on now, good references are Gardenfors and 
Sahlin (1988) Kahneman et al. (1982), Seidenfeld (1988), Kahneman and Tversky 
(1979), Tversky (1974), Fishburn (1982), and Kreps (1988). 

While this example is generally understood to be a challenge to the sure 
thing principle, we would like to suggest that there also is a connection with the 
“before/after axiom” of Section 5.2.3. This axiom requires that the preferences for 
two acts, after one learns that 9 is for sure 11 or less, should be the same as those 
expressed if the two actions are made equal in all cases where 9 is 12 or more— 
irrespective of what the common outcome is. There are individuals who, like Allais, 
are reversing the preferences depending on the outcome offered if 9 is 12 or more. 
But we suppose that the same individuals would agree that a >- a' if and only if 
b >- b' once it is known for sure that 9 is 11 or less. The two comparisons are now 


Table 5.1 Representation of the lotteries of the Allais paradox in terms 
of Savage’s theory. Rewards are in units of $100 000. Probabilities of the 
three columns are respectively 0.01, 0.10, and 0.89. 



9 = 1 

2 < 9 < 11 

12 < 9 < 100 

a 

5 

5 

5 

a' 

0 

25 

5 

b 

5 

5 

0 

b' 

0 

25 

0 
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identical for sure! So it seems that agents such as those described by Allais would 
also be likely to violate the “before/after axiom.” 


5.4 Ellsberg paradox 

The next example is due to Ellsberg. Suppose that we have an urn with 300 balls in 
it: 100 of these balls are red (R) and the rest are either blue (B) or yellow (Y). As in 
the Allais example, we will consider two pairs of actions and we will have to choose 
between lotteries a and a ', and then between lotteries b and /;'. Lotteries are depicted 
in Table 5.2. 

Suppose a decision maker expresses a preference for a over a'. In action a the 
probability of winning is 1/3. We may conclude that the decision maker considers 
blue less likely than yellow, and therefore believes to have a higher chance of win¬ 
ning the prize in a. In fact, this is only slightly more complicated than the sort of 
comparison that is used in Savage’s theory to construct the qualitative probabilities. 

What Ellsberg observed is that this is not necessarily the case. Many of the same 
decision makers also prefer b' to b. In /;' the probability of winning the prize is 2/3. 
If one thought that blue is less likely than yellow, it would follow that the probability 
of winning the prize in lottery b is more than 2/3, and thus that b is to be preferred 
to b'. 

In fact, the observed preferences violate Axiom S3 in Savage’s theory. The 
actions are such that 


a(R) = b( R) 

and 

a'( R) = b’{ R) 

fl(B) = b(B) 

and 

a\ B) = b'( B) 

a(Y) = a'( Y) 

and 

b( Y) = b\ Y). 


So we are again in the realm of Axiom S3, and it should be that if a >- a', then h > b'. 

Why is there a reversal of preferences in Ellsberg’s experience? The answer 
seems to be that many decision makers prefer gambles where the odds are known to 


Table 5.2 Actions available in the Ellsberg paradox. In Savage’s notation, 
we have 0 = {R, B, Y} and Z = {0,1000}. There are 300 balls in the urn 
and 100 are red (R), so most decision makers would agree that jt(R) =1/3 
while 7T(B) and 7r(Y) need to be derived from preferences. 



R 

B 

Y 

a 

1000 

0 

0 

a' 

0 

1000 

0 

b 

1000 

0 

1000 

b' 

0 

1000 

1000 
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gambles where the odds are, in Ellsberg’s words, ambiguous. Gardenfors and Sahlin 
comment: 

The rationale for these preferences seems to be that there is a differ¬ 
ence between the quality of knowledge we have about the states. We 
know that the proportion of red balls is one third, whereas we are uncer¬ 
tain about the proportion of blue balls (it can be anything between zero 
and two thirds). Thus this decision situation falls within the unnamed 
area between decision making under “risk” and decision making under 
“uncertainty.” 

The difference in information about the states is then reflected in the 
preferences in such a way that the alternative for which the exact proba¬ 
bility of winning can be determined is preferred to the alternative where 
the probability of winning is “ambiguous” (Ellsberg’s term). (Gardenfors 
and Sahlin 1988, p. 12) 

Ellsberg himself affirms that the apparently contradictory behavior presented in 
the previous example is not random at all: 

none of the familiar criteria for predicting or prescribing decision¬ 
making under uncertainty corresponds to this pattern of choices. Yet the 
choices themselves do not appear to be careless or random. They are 
persistent, reportedly deliberate, and they seem to predominate empiri¬ 
cally; many of the people who take them are eminently reasonable, and 
they insist that they want to behave this way, even though they may be 
generally respectful of the Savage axioms.... 

Responses from confessed violators indicate that the difference is 
not to be found in terms of the two factors commonly used to deter¬ 
mine a choice situation, the relative desirability of the possible pay-offs 
and the relative likelihood of the events affecting them, but in a third 
dimension of the problem of choice: the nature of one’s information con¬ 
cerning the relative likelihood of events. What is at issue might be called 
the ambiguity of this information, a quality depending on the amount, 
type, reliability and “unanimity” of information, and giving rise to one’s 
degree of “confidence” in an estimate of relative likelihoods. (Ellsberg 
1961, pp. 257-258) 

More readings on the Ellsberg paradox are in Fishburn (1983) and Levi (1985). 
See also Problem 5.4. 


5.5 Exercises 

Problem 5.1 Savage’s “informal” definition of the sure thing principle, from the 
quote in Section 5.2.2 is: If the person would not prefer a l to a 2 either knowing that 
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the event © 0 obtained, or knowing that the event ©j,, then he does not prefer a { to a 2 - 
Using Definition 5.1 for conditional preference, and using Savage’s axioms directly 
(and not the representation theorem), show that Savage’s axioms imply the informal 
sure thing principle above; that is, show that the following is true: if a > a' on 0 O 
and a >- a' on ©(, then a > a'. Optional question: what if you replace 0 O and ©J, 
with © 0 and 0,, where 0 O and 0, are mutually exclusive? Highly optional question 
for the hopelessly bored: what if © 0 and ©! are not mutually exclusive? 

Problem 5.2 Prove Theorem 5.1. 

Problem 5.3 One thing we find unconvincing about the Allais paradox, at least 
as originally stated, is that it refers to a situation that is completely hypothetical for 
the subject making the decisions. The sums are huge, and subjects never really get 
anything, or at most they get $10 an hour if the guy who runs the study had a decent 
grant. But maybe you can think of a real-life situation where you expect that people 
may violate the axioms in the same way as in the Allais paradox. 

We are clearly not looking for proofs here—just a scenario, but one that gets 
higher marks than Allais’ for realism. If you can add even the faintest hint that real 
individuals may violate expected utility, you will have a home run. There is a huge 
bibliography on this, so one fair way to solve this problem is to become an expert in 
the field, although that is not quite the spirit of this exercise. 

Problem 5.4 (DeGroot 1970) Consider two boxes, each of which contains both 
red balls and green balls. It is known that one-half of the balls in box 1 are red 
and the other half are green. In box 2, the proportion 0 of red balls is not known 
with certainty, but this proportion has a probability distribution over the interval 
0 < 0 < 1 . 

(a) Suppose that a person is to select a ball at random from either box 1 or box 
2. If that ball is red, the person wins $1; if it is green the person wins nothing. Show 
that under any utility function which is an increasing function of monetary gain, the 
person should prefer selecting the ball from box 1 if, and only if, £@[0] < 1/2. 

(b) Suppose that a person can select n balls (n > 2) at random from either of 
the boxes, but that all n balls must be selected from the same box; suppose that each 
selected ball will be put back in the box before the next ball is selected; and suppose 
that the person will receive $ 1 for each red ball selected and nothing for each green 
ball. Also, suppose that the person’s utility function u of monetary gain is strictly 
concave over the interval [0, n], and suppose that E e [6] = 1 /2. Show that the person 
should prefer to select the balls from box 1. 

Hint: Show that, if the balls are selected from box 2, then for any given value of 
6, E e [u] is a concave function of jt on the interval 0 < 0 < 1. This can be done by 
showing that 

— E e [u] = n(n - - Mi + 1) + u(i + 2)]f" ~ W(1 - 9)'- 2 - < 0. 

i = 0 ' ' 
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Then apply Jensen’s inequality to g(6 ) = E e [u\. 

(c) Switch red and green and try (b) again. 

Note: The function/ is concave if/(co:+(l —a)y) > af(x) + (l—a)f(y). Pictori- 
ally, concave functions are fl-shaped. Continuously differentiable concave functions 
have second derivatives < 0. Jensen’s inequality states that if g is a convex (concave) 
function, and 6 is a random variable, then E[g{0)\ > (<) g(£[0]). 

Comment: This is a very nice example, and is related to the Ellsberg paradox. The 
idea there is that in the situation of case (a), according to the expected utility princi¬ 
ple, if Ei, [0] —1/2 you should be indifferent between the two boxes. But empirically, 
most people still prefer box 1. What do you think? 

Problem 5.5 Suppose you are to choose among two experiments A and B. You are 
interested in estimating a parameter based on squared error loss using the data from 
the experiment. Therefore you want to choose the experiment that minimizes the 
posterior variance of the parameter. If the posterior variance depends on the data that 
you will collect, expected utility theory recommends that you take an expectation. 
Suppose experiment A has an expected posterior variance of 1 and B of 1.03, so you 
prefer A. But what if you looked at the whole distribution of variances, for different 
possible data sets, and discovered that the variances you get from A range from 0.5 
to 94, while the variances you get from B range from 1 to 1.06? (The distribution for 
experiment A would have to be quite skewed, but that is not unusual.) 

Is A really a good choice? Maybe not (if 94 is not big enough, make it 94 000). 
But then, what is it that went wrong with our choice criterion? Is there a fault in the 
expected utility paradigm, or did we simply misspecify the losses and leave out an 
aspect of the problem that we really cared about? Write a few paragraphs explain¬ 
ing what you would do, why, and whether there may be a connection between this 
example and any of the so-called paradoxes we discussed in this chapter. 
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State independence 


Savage’s theory provides an axiomatization that yields unique probabilities and 
utilities, and provides a foundation for Bayesian decision theory. A key element in 
this enterprise is the definition of constant outcomes: that is, outcomes that have the 
same value to the decision maker irrespective of the states of the world. Go back to 
the Samarkand story of Chapter 4: you are about to ship home a valuable rug you just 
bought for $9000. At which price would you be indifferent between buying a ship¬ 
ping insurance for the full value of the rug, or taking the risk? Compared to where we 
were in Chapter 4, we can now answer this question with or without an externally 
given probability for the rug to be lost. Implicitly, however, all the approaches we 
have available so far assume that the only relevant state of the world for this decision 
is whether the rug will be lost. For example, we assume we can consider the value 
of the sum of money paid for the rug, or the value of the sum necessary to buy insur¬ 
ance, as fixed quantities that do not change with the state of the world. But what if the 
value of the rug to us depended on how much we can resell it for in New York? Or 
what if we had to pay for the insurance in the currency of Uzbekistan? Both of these 
considerations may introduce additional elements of uncertainty that make it hard to 
know the utility of the relevant outcomes without specifying the states of a “bigger 
world” that includes the exchange rates and the sale of the rug in addition to the ship¬ 
ment outcome. Our solutions only apply to the “small world” whose only states are 
whether the rug will be lost or not. They are good to the extent that the small world is 
a good approximation of bigger worlds. Savage (1954, Sec. 5.5) thought extensively 
about this issue and included in his book a very insightful section, pertinently called 
“Small worlds.” Elsewhere, he comments: 


Informally, or extraformally, the consequences are conceived of as what 
the person experiences and what he has preferences for even when there 
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is no uncertainty. This idea of pure experience, as a good or ill, seems 
philosophically suspect and is certainly impractical. In applications the 
role of consequence is played by such things as a cash payment or a 
day’s fishing of which the “real consequences” may be very uncertain but 
which are nonetheless adapted to the role of sure consequences within 
the context of the specific application. (Savage 1981a, p. 306) 

To elaborate on this point, in this chapter we move away from Savage’s theory 
and consider another formulation of subjective expected utility theory, developed 
by Anscombe and Aumann (1963). Though this theory is somewhat less gen¬ 
eral than Savage’s, here too we have both personal utilities on outcomes and 
personal probabilities on unknown states of the world. A real nugget in this the¬ 
ory is the clear answer given to the question of independence of utilities and 
states of the world. We outline this theory in full, and present an example, due 
to Schervish et al. (1990), that illustrates the difficulties in the definition of small 
worlds. 

Featured articles: 

Anscombe, F. J. & Aumann, R. J. (1963). A definition of subjective probability, 
Annals of Mathematical Statistics 34: 199-205. 

Useful background readings are Kreps (1988) and Fishbum (1970). 


6.1 Horse lotteries 

We start by taking a finite set © to be the set of all possible states of the world. 
This will simplify the mathematics and help us home in to the issue of state indepen¬ 
dence more cleanly. For simplicity of notation we will label states with integers, 
so that © = {1,2,...,£}. Also, Z will be the set of prizes or rewards. We will 
assume that Z is finite, which will enable us to use results from the NM theory. 
The twist in this theory comes with the definition of acts. We start with simple 
acts: 

Definition 6.1 (Simple act) A simple act is a function a : © —»■ Z. 

So far so good: these are defined as in Savage. Although we are primarily 
interested in simple acts, we are going to build the theory in terms of more 
complicated things called horse lotteries. The reason is that in this way we can 
exploit the machinery of the NM theory to do most of the work in the repre¬ 
sentation theorems. Let us suppose that we have a randomization device that lets 
us define objective lotteries like in the NM theory. Let P be the set of proba¬ 
bility functions on Z. Anscombe and Aumann (1963) consider acts to be any 
function from states of the world to one of these probability distributions. They 
termed these acts “horse lotteries” to suggest that you may get one of k von 
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Neumann and Morgenstern lotteries depending on which horse (0) wins a race. 
Formally 

Definition 6.2 (Act, or horse lottery) An act is a function a : 0 -* P. 

Then every a e A can be written as a list of functions: 

a = (a(l),a(2),... ,a(k)) 


and a(0) e P, 0 = 1 We also use the notation a(0, z) to denote the probability 

that lottery a(0) assigns to outcome z. 

This leads to a very clean system of axioms and proofs but it is a little bit artificial 
and it requires the notion of an objective randomization. In this regard, Anscombe 
and Aumann write: 


anyone who wishes to avoid a concept of physical chance distinct from 
probability may reinterpret our construction as a method of defining 
more difficult probabilities in terms of easier ones. Such a person may 
consider that probabilities may be assigned directly to the outcome 
of spins of a roulette wheel, flips of a coin, and suchlike from con¬ 
siderations of symmetry. The probabilities may be so widely agreed 
on as to be termed impersonal or objective probabilities. Then with 
some assumptions concerning independence, our construction can be 
used to define subjective probabilities for other sorts of outcomes 
in terms of these objective probabilities. (Anscombe and Aumann 
1963, p. 204) 


The theory, as in NM theory, will require compound acts. These are defined as 
earlier, state by state. Formally: 

Definition 6.3 (Compound acts) For a and a' from A and for a e [0, 1], define 
the act a" = aa + (1 — a)a' by 


a"(0) = aa(d) + (1 - a)a\6) V0 e 0. 


Figure 6.1 illustrates this notation. In the figure, © = {1,2} and Z = {10,15,20, 
25,30}. So, a(l) = (0.5,0.3,0.2,0.0,0.0), while a'(2) = (0.0,0.0,0.6,0.0,0.4). If 
we define a compound act of the form a" = 0.6a + 0.4a', we have 
a'f 1) = (0.3,0.18,0.32,0.2,0.0) and a"(2) = (0.0,0.0,0.84,0.0,0.16). 
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Figure 6.1 Two Anscombe—Aumann actions a and a' and their compound action 
with a = 0.6. Here 1 and 2 are the two possible states of the world. 


6.2 State-dependent utilities 

Now we are able to introduce a set of axioms with respect to preferences (that is, >-) 
among elements of A. 

Axiom AA1 >- on A is a preference relation. 

Axiom AA2 If a >- a' and a e (0,1 ], then 

aa + (1 — o ;)a" >- a. a' + (1 — a)a" 

for every a" e A. 

Axiom AA3 If a >- a' >- a", then 3 a, ft e (0,1) such that 
aa + (1 — a)a" >~ a' >~ fa + (1 — f)a". 

These axioms are the same as NM1, NM2, and NM3 except that they apply to the 
more complicated acts that we are considering here. 

Based on results from the NM representation theorem, there is a function / : 
A —> SR that represents > and satisfies 


f{aa + (1 — a)a!) = af(a) + (1 — a)f(a'). 
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Using this fact and a little more work (Kreps 1988), one can establish the following 
result. 

Theorem 6.1 Axioms AA1, AA2, and AA3 are necessary and sufficient for the 
existence of real-valued u k , such that 

a >- a' ■<=> EE u e (z)a(9,z) > EE u e {z)a'(9,z). 

e z 9 z 


Also, if u [,... ,u' k is another collection of functions satisfying such a condition, then 
3 a > 0 and 0 — 1 ,k, such that au e + = u' e . 


This theorem is great progress, but we are still far from the standard expected util¬ 
ity representation. The reason is that the functions u depend on both 9 and z- These 
are usually called state-dependent utilities (you have a different utility function in 
every state of the world) and they are a mix of what we usually call utility (the state- 
independent ones) and what we call probability. Specifically, if we allow utilities to 
differ across states, the uniqueness of personal probability no longer holds, and there 
are multiple probability distributions that satisfy the representation of Theorem 6.1. 
For example, the representation of Theorem 6.1 can be interpreted to mean that ku 9 (z) 
is the (state-dependent) utility and tt(0) = I /k are the personal probabilities. Alter¬ 
natively, for any other probability distribution it* such that tv*(0) > 0 for every state, 
we can define 


u*(z) = 


«fl(z) 

TX*(0) 


and still conclude that a > a' if and only if 


^2 ^2 z) > E E “e(z)a'(0, z). 

9 z 9 z 


We can therefore represent the given preferences with multiple combinations of 
probabilities and state-dependent utilities. 


6.3 State-independent utilities 

The additional condition that is needed to disentangle the probability from the util¬ 
ity is state independence. Specifically, we need one more definition and two more 
axioms. A state 9 is called null if we no longer care about the outcome of the lot¬ 
tery once 9 has occurred. This translates into the condition that we are indifferent 
between any two acts that differ only in what happens if 9 occurs. Formally we have: 

Definition 6.4 (Null state) The state 9 is said to be null if a ~ a' for all pairs a 
and a 1 such that a(9') — a'{9') for all 9' ^ 9. 
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Here are the two new axioms. AA4 is just a structural condition to avoid wasting 
time with really boring problems where one is indifferent to everything. AA5 is a 
more serious weapon. 

Axiom AA4 There exist a and a' in A such that a' > a. 

Axiom AA5 Take any a e A and any two probability distributions p and q on Z. If 

(fl(l),... ,a(9 - 1 ),p,a(9 + 1),... ,a{k)) >- 

(a(l),... ,a(9 — l),q,a(9 + 1),... ,a(k)) 

for some state 9, then for all non-null 9' 


(a(l),... ,a(9' — 1 ),p,a(9' + 1),... ,a(k)) > 

(a(l),...,a(9'- \),q,a{9’ +\\...,a(k)). 

AA5 is a monotonicity axiom which asks us to consider two comparisons: in the 
first, the two actions are identical except that in state 9 one has lottery p and the 
other q. In the second, the two actions are again identical, except that now it is in 
state 9’ that one has lottery p and the other q. Suppose that one prefers the action 
with lottery p in the first comparison. The axiom requires that the preference will 
hold in the second comparison as well. So the preference for p over q is indepen¬ 
dent of the state. So Axiom AA5 says that preferences should be state independent. 
Along with AA1-AA4, this condition will provide a representation theorem, dis¬ 
cussed in the next section, with unique probabilities and unique utilities up to an 
affine transformation. 

Before we move to the details of this representation, consider this example, 
reported by Schervish et al. (1990), that shows what could go wrong with AA5. Sup¬ 
pose an agent, who expresses preferences according to Anscombe-Aumann’s axioms 
and has linear utility for money (that is, u(cz) = cu(z)), is offered to choose among 
three simple acts a u a 2 , a 3 whose payoffs are described in Table 6.1 depending on 
the states of nature 9 l ,9 2 ,9 3 . Assuming that the agent has state-independent utility, 
the expected utility of lottery a, is u(\)jt( 0,), i = 1,2,3. Furthermore, if all three 


Table 6.1 Payoffs for six horse lotteries in the dollar/yen example. 



0, 

0 2 

$3 

fll 

$1 

0 

0 

a 2 

0 

$1 

0 

a 3 

0 

0 

$1 

CI 4 

¥100 

0 

0 

a 5 

0 

¥125 

0 

a 6 

0 

0 

¥150 
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horse lotteries are equivalent for the agent, then it must be that he or she considers 
the states to be equally likely, that is tt (#,) = 1/3, i = 1,2,3. 

Now imagine that the agent is also indifferent among lotteries a 4 ,a 5 ,a 6 of 
Table 6.1. Again assuming state-independent linear utility for yen payoffs and assum¬ 
ing that for the agent the lotteries are equivalent, we conclude that 7r*(0!)n(lOO) = 
jt*(9 2 )u( 125) = 7T*(0 3 )m( 15O), or tc(9i) = 1.25 tt{9 2 ) = 1.5jt( 0 3 ) which implies 
7r*(6\) = 0.4054, ji*(9 2 ) = 0.3243, and 7r*(6> 3 ) = 0.2703. 

So this is in seeming contradiction with the indifference among a } , a 2 . a 3 . How¬ 
ever, the agent is not necessarily inconsistent, but may simply not be willing to follow 
axiom AA5. For example, the states could represent exchange rates between dollars 
and yen (<f = (1 dollar is worth 100 yen}, 9 2 = {1 dollar is worth 125 yen}, and 
0 3 = {1 dollar is worth 150 yen}). Then AA5 would be untenable. The reward set is 
Z = {$1,¥100,¥125,¥150}. In order to make comparisons among the rewards in 
Z you must know the state of nature. 

You can see that if the agent is (and in this case for a good reason) not ready to 
endorse AA5, we are left with the question of which is the agent’s “personal” prob¬ 
ability? The uniqueness of the probability depends on the choice of what counts as a 
“constant” of utility—the dollar or the yen. Schervish et al. (1990), who concocted 
this example, have an extensive discussion of related issues. They also make an inter¬ 
esting comment on the implications of AA5 for statistical decision theory and all the 
good stuff in Chapter 7: 

Much of statistical decision theory makes use of utility functions of the 
form u(a{9)), where 9 is a state of nature and a is a possible decision. 

The prize awarded when decision a is chosen and the state of nature is 
9 is not explicitly mentioned. Rather, the utility of this prize is specified 
without reference to the prize. Although it would appear that u(a{9)) is a 
state dependent utility (as well it might be), one has swept comparisons 
between states “under the rug”. For example, if u(a(9)) = — (9 — a) 2 , 
one might ask how it was determined that an error of 1 when 9 = 9[ has 
the same utility of an error of 1 when 9 — 9 2 . (Schervish et al. 1990, 
pp. 846-847 with notational changes) 


6.4 Anscombe-Aumann representation theorem 

We now are now ready to discuss the Anscombe-Aumann representation theorem 
and its proof. 

Theorem 6.2 Axioms AA1-AA5 are necessary and sufficient for the existence of a 
nonconstant function u : Z —>■ SR and a probability distribution n on 0 such that 


a >- a 


Y.nWY. u (z)a(9, z) > u(z)a'(9,z). 
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Moreover, the probability distribution jt is unique , and u is unique up to a positive 
linear transformation. 

This sounds like magic. Where did the probability come from? The proof sheds 
some light on this. Look out for a revealing point when personal probabilities materi¬ 
alize from the weights in the affine transformations that define the class of solutions 
to the NM theorem. Our proof follows Kreps (1988). 

Proof: The proof in the => direction is the most interesting. It starts by observ¬ 
ing that axioms AA1-AA3 imply a representation as given by Theorem 6.1. Axiom 
AA4 implies that there is at least one nonnull state, say 6 0 . For any two probability 
distributions p and q on Z, any nonnull state 0 and an arbitrary a, we have, that 


^2 ue(z)p(z ) > ^2 u »(^9(z) 


if and only if 

(a(l),... ,a(9 — 1 ),p,a(6 + 1),... ,a(k)) >- 

if and only if (by application of Axiom AA5) 

(«(1),... ,a(9 o - l),p,a(6 0 + 1),... ,«(£)) > 

(a(l),... ,a(0 0 — l),q,a(0 o + 1 a(k)) 

if and only if (by the representation in Theorem 6.1) 



For simple lotteries, the result on uniqueness in NM theory guarantees that there 
are constants a e > 0 and ft,, such that 


oteUg Q {.) + fte = Me(-)- 


In particular, for null states, a e = 0 (because a state 0 is null if and only if u„ is 
constant). We can define u(z) = u 6 q {z) (and a So = 1, ft 9o = 0) which, along with the 
representation obtained with Theorem 6.1, implies 


a >- a! 


y. y(u e u(z) + fte)a{0,z) > y y(a e u(z) + fte)a\e,z) 
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or, equivalently. 



Subtracting the sum of fl n and dividing the remaining elements on both sides 
of the inequality by the positive quantity a,y completes the proof if we take 

tt( 0) = <x e /Y.e> “»'• 

The proof of the uniqueness part of the theorem is left as an exercise. The proof 
that axioms A A1-AA5 are necessary for the representation in Theorem 6.2 is a 
worked exercise. 

Anscombe-Aumann’s representation theorem assumes that the set of possible 
states of the world 0 is finite. An extension of this representation theorem to arbitrary 
© is provided by Fishburn (1970). □ 

6.5 Exercises 

Problem 6.1 (Schervish 1995, problem 31, p. 212) Suppose that there are k > 2 
horses in a race and that a gambler believes that 7r, is the probability that horse i will 
win (^- =1 rtj =1). Suppose that the gambler has decided to wager an amount a to be 
divided among these k horses. If he or she wagers a, on horse i and that horse wins, 
the utility of the gambler is log(c,a,), where c u ... ,c k are known positive numbers. 
Find values a\,...,a k that maximize the expected utility. 

Problem 6.2 Prove the uniqueness of tc and u in Theorem 6.2. 

Problem 6.3 Prove that if there is a function u : Z —> 91, and a probability 
distribution tt on 0 such that 



8 



e 


then >- satisfies axioms AA1-AA3 and AA5, assuming AA4. 

Solution 

We do this one axiom at a time. 

Axiom AA1 >- on A is a preference relation. 

We know that ^ is a preference relation if V a, a e A, (i) a > a', a > a 1 , or a ~ a', 
and (ii) a > a' and a’ > a", then a > a". We are going to call U, U’, and so on the 
expected utilities associated with actions a, a', and so on. For example. 
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U' = 22 n{9) 22 u(z)a'(9, z). 

fl z 

(i) Since the U are real numbers, U > U ', > U, orlA = U ', and from (6.1) 

a >- a', a' >- a, or a ~ a'. 

(ii) Since a >- a' =>• U >U' and a' > a" => U' > U". Consequently, > U", 

by transitivity on 1)1. 

Axiom AA2 If a >- a' and a e (0, 1 ], then 

aa + (1 — oi)a" >- aa' + (1 — a)a", for every a" e A. 

If a > a' then 

a 22 tt( 9) 22 u(z)a(9, z) > a 22 n(9) 22 u(z)a'(9, z) 

e z e z 

for all a e (0,1]. Therefore, 

a 22 n(9)y^ u(z)a(9,z) + (1 - a) y jt(6) y u(z)a"(9,z) > 

e z e z 

a ’22 71 (0) ^2 u (z)a'(9,z) + (!—«) ^2 71 ^ ^2 u (z)a"(9,z) 

e z e z 

or 

^7 t(9)'22 u (z) \aa{9,z) + (1 - a)a"(9,z)] > 

e z 

y, 7t(9) 22 M fe) [ oia' g (z ) + (1 - a)a"(9,z)\ 

e z 

and it follows from (6.1) that 

aa + (1 — a)a" > aa' + (1 — a)a". 

Axiom AA3 If a >- a' >- a ", then 3 a, e (0,1) such that 
aa + (1 — a)a" >- a' >- Pa + (1 — P)a". 

From (6.1), if a >~ a' > a". 


U = 22 71 ( 0 ) y, u(z)a(9,z) > U' — 22 71 W "y, Kz)d(9,z) 

e z e z 

> U" = 22 71 ( 0 ) y] u(z)a"(9,z) 
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where U,U'M" e 9t. By continuity on 9i, there exist a, ft e (0,1] such that 
pU + {1 - P)U" < U' < aU + (1 - a)U" 
or 

X!E «(z)[M0.z) + (1 - >8)a"(0,z)] 

e z 

< y, 7r(fl) y_ u(z)d(9,z ) 

0 z 

< y ?r(0) y m(z) [aa(0,z) + (1 — a)a"(0,z)] 

0 z 

and then aa + (1 — oi)a" >- d >- f3a + (1 — j5)a". 

Axiom AA5 If a e A and p,q e P are such that 

(a(l ),.. .,a(i— 1 ),p,a(i + 1),... ,a(k)) > 

(a(l), 1 ),q,a(i+ 1 a(k)) 

for some i, then for all nonnull j 

(fl(l ),... ,a(j — 1 ),p,a(j+ 1 a(k)) > 

(a(l), ...,a(j— 1 ),q,a(j+ 1 a(k)). 

Redefine 

a= (a(l),...,«(/- 1 ),p,a(i+ 1 a(k)) 

a = (fl(l),- a(i - l),q,a(i + 1),... ,a(k)) 

a" = (a(l), ...,a(j- 1 ),p,a(j + 1 a(k)) 
a" = (a(l), ...,a(j- 1 ),q,a(j+ 1 a(k)) 

where i and j are nonnull states. Suppose that 

a >- d but d" > a". 

Then from (6.1) a >- d if and only if 

y 7t(0) y u(z)a(0, z) > y tt(j9) y u{z)d(0, z). 

0 z 0 z 

Since a{0) = d(0), 0 — 1,21,1+ 1 ,k, a(i) — p, and d(i) = q we can 
see that 


y u(z)p(z) > y u(z)q(z) 
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orp^q, in NM terminology. Analogously, 

a" > a =>■ ^2 71 (0) ^2 u (z) a '"(9,z) > ^ n{9) ^ u(z)a\9,z) 

e z 9 z 

=> J2 > X! 

z z 

NM q > p 

which is a contradiction, since p must be preferred to q in any nonnull state. 
Therefore, a" >- a'". □ 

Problem 6.4 (Kreps 1988, problem 5, p. 112) What happens to this theory if the 
outcome space Z changes with the state? That is, suppose that in state 9 the possible 
outcomes are given by a set Z e . How much of the development above can you adapt 
to this setting? If you know that there are at least two prizes that lie in each of the 
Z e , how much of this chapter’s development can you carry over? 

Hint: Spend a finite amount of time on this and then write down your thoughts. 


Part Two 

Statistical 
Decision Theory 
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Decision functions 


This chapter reviews the architecture of statistical decision theory—a formal attempt 
at providing a rational foundation to the way we learn from data. Our overview 
is broad, and covers concepts developed over several decades and from different 
viewpoints. The seed of the ideas we present, as Ferguson (1976) points out, can be 
traced back to Bernoulli (1738), Laplace (1812), and Gauss (1821). During the late 
1880s, an era when the utilitarianism of Bentham and Mill had a prominent role in 
economics and social sciences, Edgeworth commented: 

the higher branch of probabilities projects into the field of Social Sci¬ 
ence. Conversely, the Principle of Utility is at the root of even the more 
objective portions of the Theory of Observations. The founders of the 
Science, Gauss and Laplace, distinctly teach that, in measuring a phys¬ 
ical quantity, the quaesitum is not so much that value which is most 
probably right, as that which may most advantageously be assigned— 
taking into account the frequency and the seriousness of the error 
incurred (in the long run of metretic operations) by the proposed method 
of reduction. (Edgeworth 1887, p. 485) 

This idea was made formal and general through the conceptual framework 
known today as statistical decision theory, due essentially to Abraham Wald (Wald 
1945, Wald 1949). Historical and biographical details are in Weiss (1992). In his 
1949 article on statistical decision functions, a prelude to a book of the same title to 
appear in 1950, Wald proposed a unifying framework for much of the existing statis¬ 
tical theory, based on treating statistical inference as a special case of game theory. 
A mathematical theory of games that provided the foundation for economic theory 
had been proposed by von Neumann and Morgenstern in the same 1944 book that 
contained the axiomatization of utility theory discussed in Chapter 3. Wald framed 


Decision Theory: Principles and Approaches G. Parmigiani, L. Y. T. Inoue 
© 2009 John Wiley & Sons, Ltd 
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statistical inference as a two-person, zero-sum game. One of the players is Nature and 
the other player is the Statistician. Nature chooses the probability distribution for the 
experimental evidence that will be observed by the Statistician. The Statistician, on 
the other hand, observes experimental results and chooses a decision—for example, 
a hypothesis or a point estimate. In a zero-sum game losses to one player are gains 
to the other. This leads to statistical decisions based on the minimax principle which 
we will discuss in this chapter. The minimax principle tends to provide conservative 
statistical decision strategies, and is often justified this way. It is also appealing from 
a formal standpoint in that it allows us to borrow the machinery of game theory to 
prove an array of results on optimal decisions, and to devise approaches that do not 
require a priori distribution on the unknowns. However, the intrinsically pessimistic 
angle imposed by the zero-sum nature of the game has backlashes which we will 
begin to illustrate as well. 

We find it useful to distinguish two aspects of Wald’s contributions: one is the 
formal architecture of the statistical decision problem, the other is the rationality 
principle invoked to solve it. The formal architecture lends itself to statistical deci¬ 
sion making under the expected utility principle as well. We will define and contrast 
these two approaches in Sections 7.1 and 7.2. We then illustrate their application 
by covering common inferential problems: classification and hypothesis testing in 
Section 7.5, point and interval estimation in Section 7.6.1. Lastly, in Section 7.7, we 
explore the theoretical relationship between expected utility and minimax rules and 
show how, even when using a frequentist concept of optimality, expected utility rules 
are often preferable. 

Featured article: 

Wald, A. (1949). Statistical decision functions. Annals of Mathematical Statistics 
20: 165-205. 

There are numerous excellent references on this material, including Ferguson 
(1967), Berger (1985), Schervish (1995), and Robert (1994). In places, we will make 
use of concepts and tools from basic parametric Bayesian inference, which can be 
reviewed, for example, in Berger (1985) or Gelman et al. (1995). 


7.1 Basic concepts 

7.1.1 The loss function 

Our decision maker, in this chapter the Statistician, has to choose among a set of 
actions, whose consequences depend on some unknown state of the world, or state of 
nature. As in previous chapters, the set of actions is called A, and its generic mem¬ 
ber is called a. The set of states of the world is called 0, with generic element 6. 
The basis for choosing among actions is a quantitative assessment of their conse¬ 
quences. Because the consequences also depend on the unknown state of the world, 
this assessment will be a function of both a and 6. So far, we worked with utilities 
u(a(6)), attached to the outcomes of the action. Beginning with Wald, statisticians 
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are used to thinking about consequences in terms of the loss associated with each 
pair (9, a) e (0 x A) and define a loss function L(9, a). 

In Wald’s theory, and in most of statistical decision theory, the loss incurred by 
choosing an action a when the true state of nature is 9 is relative to the losses incurred 
with other actions. In one of the earliest instances in which Wald defined a loss 
function (then referred to as weight function), he writes: 

The weight function L{9,a) is a real valued non-negative function 
defined for all points 9 of 0 and for all elements a of A, which expresses 
the relative importance of the error committed by accepting a when 9 is 
true. If 9 is contained in a, L(9,a ) is, of course, equal to 0. (Wald 1939, 
p. 302 with notational changes) 

The last comment refers to a decision formulation of confidence intervals, in which 
a is a subset of the parameter space (see also Section 7.6.2). If 9 is contained in a, 
then the interval covers the parameter. However, the concept is general: if the “right” 
decision is made for a particular 9 , the loss should be zero. 

If a utility function is specified, we could restate utilities as losses by considering 
the negative utility, and by defining the loss function directly on the space (9, a) e 
(© x ^4), as in 

u(a(9)) = —L„(9,a). (7.1) 

Incidentally, if u is derived in the context of a set of axioms such as Savage’s or 
Anscombe and Aumann’s, a key role in this definition is played by state indepen¬ 
dence. In the loss function, there no longer is any explicit consideration for the 
outcome z = a(9) that determined the loss. However, it is state independence that 
guarantees that losses occurring at different values of 9 can be directly compared. 
See also Schervish, Seidenfeld et al. (1990) and Section 6.3. 

However, if one starts from a given utility function, there is no guarantee that, for 
a given 9, there should be an action with zero loss. This condition requires a further 
transformation of the utility into what is referred to as a regret loss function L(0, a). 
This is calculated from the utility-derived loss function L,,(9, a) as 

L(9, a) — L„(9,a) — inf L„{9, a). (7.2) 

a&A 

The regret loss function measures the inappropriateness of action a under state 9. 
Equivalently, following Savage, we can define the regret loss function directly from 
utilities as the conditional difference in utilities: 

L(9,a ) = sup u(a'(9)) — u(a(9)). (7.3) 

o'(S) 

From now on we will assume, unless specifically stated, that the loss is in regret form. 
Before explaining the reasons for this we need to introduce the minimax principle and 
the expected utility principle in the next two sections. 
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7.1.2 Minimax 

The minimax principle of choice in statistical decision theory is based on the analogy 
with game theory, and assumes that the loss function represents the reward structure 
for both the Statistician and opponent (Nature). Nature chooses first, and so the best 
strategy for the Statistician is to assume the worst and chose the action that minimizes 
the maximum loss. Formally: 

Definition 7.1 (Minimax action) An action a M is minimax if 

a M = argminmax L(6,a). (7.4) 

e 

When necessary, we will distinguish between the minimax action obtained from the 
loss function L„(6,a), called minimax loss action, and that obtained L(6,a), called 
minimax regret action (Chernoff and Moses 1959, Berger 1985). 

Taken literally, the minimax principle is a bit of a paranoid view of science, and 
even those of us who have been struggling for years with the most frustrating sci¬ 
entific problems, such as those of cancer biology, find it a poor metaphor for the 
scientific enterprise. However, it is probably not the metaphor itself that accounts for 
the emphasis on minimax. First, minimax does not require any knowledge about the 
chance that each of the states of the world will turn out to be true. This is appealing 
to statisticians seeking a rationality-based approach, but not willing to espouse sub¬ 
jectivist axioms. Second, minimax statistical decisions are in many cases reasonable, 
and tend to err on the conservative side. 

Nonetheless, the intrinsic pessimism does create issues, some of which motivate 
Wald’s definition of the loss in regret form. In this regard, Savage notes: 

It is often said that the minimax principle is founded on ultra-pessimism, 
that it demands that the actor assume the world to be in the worst possi¬ 
ble state. This point of view comes about because neither Wald nor other 
writers have clearly introduced the concept of loss as distinguished from 
negative income. But Wald does frequently say that in most, if not all, 
applications u(a{0)) is never positive and it vanishes for each 0 if a is 
chosen properly, which is the condition that — u(a(d )) = L(0,a). Appli¬ 
cation of the minimax rule to — u(a(0 )) generally, instead of to Lid, a), 
is indeed ultra-pessimistic; no serious justification for it has ever been 
suggested, and it can lead to the absurd conclusion in some cases that no 
amount of relevant experimentation should deter the actor from behav¬ 
ing as though he were in complete ignorance. (Savage 1951, p. 63 with 
notational changes) 

While we have not yet introduced data-based decision, it is easy to see how this 
may happen. Consider the negative utility loss in Table 7.1. Nature, mean but not 
dumb, will always pick 0 3 , irrespective of any amount of evidence (short of a revela¬ 
tion of the truth) that experimental data may provide in favor of 0, and 0 2 . The regret 
loss function is an improvement on this pessimistic aspect of the minimax principle. 
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Table 7.1 Negative utility loss function and corresponding 
regret loss function. The regret loss is obtained by subtracting, 
column by column, the minimum value in the column. 


Negative utility loss 

Regret loss 


0i 

02 03 

0i 

02 

03 

<7; 1 

0 6 

oi 0 

0 

1 

a 2 3 

4 5 

a 2 2 

4 

0 


For example, after the regret transformation, shown on the right of Table 7.1, Nature 
has a more difficult choice to make between 0 2 and 0 3 . We will revisit this discus¬ 
sion in Section 13.4 where we address systematically the task of quantifying the 
information provided by an experiment towards the solution of a particular decision 
problem. 

Chernoff articulates very clearly some of the most important issues with the 
regret transformation: 


First, it has never been clearly demonstrated that differences in utility do 
in fact measure what one may call regret. In other words, it is not clear 
that the “regret” of going from a state of utility 5 to a state of utility 3 is 
equivalent in some sense to that of going from a state of utility 11 to one 
of utility 9. Secondly, one may construct examples where an arbitrarily 
small advantage in one state of nature outweighs a considerable advan¬ 
tage in another state. Such examples tend to produce the same feelings of 
uneasiness which led many to object to minimax risk. ... A third objec¬ 
tion which the author considers very serious is the following. In some 
examples the minimax regret criterion may select a strategy a 3 among 
the available strategies a t , a 2 , a 3 and « 4 . On the other hand, if for some 
reason « 4 is made unavailable, the minimax regret criterion will select a, 
among a t , a 2 and a 2 . The author feels that for a reasonable criterion the 
presence of an undesirable strategy « 4 should not have an influence on the 
choice among the remaining strategies. (Chernoff 1954, pp. 425-426) 


Savage thought deeply about these issues, as his initial book plan was to develop 
a rational foundation of statistical inference using the minimax, not the expected 
utility principle. His initial motivation was that: 


To the best of my knowledge no objectivistic motivation of the minimax 
rule has ever been published. In particular, Wald in his works always 
frankly put the rule forward without any motivation, saying simply that 
it may appeal to some. (Savage 1954, p. 168) 
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He later abandoned this plan but in the same section of his book—a short and very 
interesting one—he still reports some of his initial hunches as to why it may have 
worked: 

there are practical circumstances in which one might well be willing to 
accept the rule—even one who, like myself, holds a personalistic view 
of probability. It is hard to state the circumstances precisely, indeed they 
seem vague almost of necessity. But, roughly, the rule tends to seem 
acceptable when min,, max,, L(9, a) is quite small compared with the val¬ 
ues of L{6, a) for some acts a that merit some serious consideration, and 
some values of 9 that do not in common sense seem nearly incredible.... 

It seems to me that any motivation of the minimax principle, objectivistic 
or personalistic depends on the idea that decision problems with rela¬ 
tively small values of min,, max,, L(0, a) often occur in practice. ... The 
cost of a particular observation typically does not depend at all on the 
uses to which it is to be put, so when large issues are at stake an act 
incorporating a relatively cheap observation may sometime have a rela¬ 
tively small maximum loss. In particular, the income, so to speak, from 
an important scientific observation may accrue copiously to all mankind 
generation after generation. (Savage 1954, pp. 168-169) 

7.1.3 Expected utility principle 

In contrast, the expected utility principle applies to expected losses. It requires, or if 
you will, incorporates, information about how probable the various values of 9 are 
considered to be, and weighs the losses against their probability of occurring. As 
before, these probabilities are denoted by tt(9). The action minimizing the resulting 
expectation is called the Bayes action. 

Definition 7.2 (Bayes action) An action a* is Bayes if 



where we define 



(7.6) 


as the prior expected loss. 


Formally, the difference between (7.5) and (7.4) is simply that the expecta¬ 
tion operator replaces the maximization operator. Both operators provide a way of 
handling the indeterminacy of the state of the world. 

The Bayes action is the same whether we use the negative utility or the regret 
form of the loss. From expression (7.2), the two representations differ by the quantity 
inf „ fc _4 L u {9, a); after taking an expectation with respect to 9 this is a constant shift 
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in the prior expected loss, and has no effect on the location of the minimum. None of 
the issues Chernoff was concerned about in our previous section are relevant here. 

It is interesting, on the other hand, to read what Wald’s perspective was on Bayes 
actions: 

First, the objection can be made against it, as Neyman has pointed out, 
that 9 is merely an unknown constant and not a variate, hence it makes 
no sense to speak of the probability distribution of 9. Second, even if 
we may assume that 9 is a variate, we have in general no possibility of 
determining the distribution of 9 and any assumptions regarding this dis¬ 
tribution are of hypothetical character. ... The reason why we introduce 
here a hypothetical probability distribution of 9 is simply that it proves 
to be useful in deducing certain theorems and in the calculation of the 
best system of regions of acceptance. (Wald 1939, p. 302) 

In Section 7.7 and later in Chapter 8 we will look further into what “certain 
theorems” are. For now it will suffice to note that Bayes actions have been stud¬ 
ied extensively in decision theory from a frequentist standpoint as well, as they can 
be used as technical devices to produce decision rules with desirable minimax and 
frequentist properties. 

7.1.4 Illustrations 

In this section we illustrate the relationship between Bayes and minimax decision 
using two simple examples. 

In the first example, a colleague is choosing a telephone company for interna¬ 
tional calls. Thankfully, this particular problem has become almost moot since the 
days we started working on this book, but you will get the point. Company A is 
cheaper, but it has the drawback of failing to complete an international call 1000% 
of the time. On the other hand, company B, which is a little bit more expensive, never 
fails. Actions are A and B, and the unknown is 9. Fler loss function is as follows: 

L(9,A) = 29, 9 e [0,1] 

L(9,B) = 1. 

Here the value of 1 represents the difference between the subscription cost of com¬ 
pany B and that of company A. To this your colleague adds a linear function of 
the number of missed calls, implying an additional loss of 0.02 units of utility for 
each percentage point of missed calls. So if she chooses company B her loss will be 
known. If she chooses company A her loss will depend on the proportion of times she 
will fail to make an international call. If 9 was, say, 0.25 her loss would be 0.5, but if 
9 was 0.55 her loss would be 1.1. The minimax action can be calculated without any 
further input and is to choose company B, since 


supL(0,A) = 2 > 1 = swpL(9,B). 
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This seems a little conservative: company A would have to miss more than half the 
calls for this to be the right decision. Based on a survey of consumer reports, your 
colleague quantifies her prior mean for 0 as 0.0476, and her prior standard deviation 
as 0.1487. To keep things simple, she decides that the beta distribution is a reasonable 
choice for the prior distribution on 9. By matching the first and second moments of 
a Betci(a 0 , /3 0 ) distribution the hyperparameters are a 0 = 0.05 and ft () = 1.00. Thus, 
the prior probability density function is 

7r(0) = O.O5 0-°' 95 / [o , 1] (0) 


and the prior expected loss is 


/' 


L(0, a)n (0)d0 


| / 0 ‘ 20jt(0)d6 = 2E g [6] if a = A 
[/ 0 ‘ hr (0)d6 = 1 if a — B. 


Since 2E g [0] = 2x0.05/(1+0.05) ~ 0.095 is less than 1, the Bayes action is to apply 
for company A. Bayes and minimax actions give different results in this example, 
reflecting different attitudes towards handling the fact that 0 is not known. Inter¬ 
estingly, if instead of checking consumer reports, your colleague chose tt{9) — 1, 
a uniform prior that may represent lack of information about 6, the Bayes solution 
would be indifference between company A and company B. 

The second example is from DeGroot (1970) and provides a useful geometric 
interpretation for Bayes and minimax actions when A and © are both finite. Take 
© = {0i,# 2 } and A — {«i,... ,a 6 ] with the loss function specified in Table 7.2. For 
any action a, the possible losses can be represented by the two-dimensional vector 


y a = [L(6 l ,a),L(e 2 ,a)]'. 


Figure 7.1 visualizes the vectors y,,... ,y 6 corresponding to the losses of each of 
the six actions. Say the prior is jtiOO = 1/3. In the space of Figure 7.1, actions that 
have the same expected loss with respect to this prior all lie on a line with equation 

^L(0!,a)+ ? L(0 2 ,a ) = k, 

where k is the expected loss. Three of these are shown as dashed lines in Figure 7.1. 
Bayes actions minimize the expected loss; that is, minimize the value of k. Geo¬ 
metrically, to find a Bayes action we look for the minimum value of k such that 


Table 7.2 Loss function. 




«2 

a 3 

a 4 

a 5 

a 6 

01 

10 

8 

4 

2 

0 

0 

02 

0 

1 

2 

5 

6 

10 
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L(0 2 , a) 



Figure 7.1 Losses and Bayes action. The Bayes action is a 3 . 


the corresponding line intersects an available point. In our example this happens in 
correspondence of action a 3 , for k = 8/3. 

Consider now selecting a minimax action, and examine again the space of losses, 
now reproduced in Figure 7.2. Actions that have the same maximum loss k all lie on 
the indifference set made of the vertical line for which L(0 1; a) = k and L(9 2 ,a) < k 
and the horizontal line for which Ufiu a) < k and L(0 2 . a) = k. To find the minimax 
solution we look for the minimum value of k such that the corresponding set inter¬ 
sects an available point. In our example, the solution is again a 3 , with a maximum 
loss of 4. The corresponding set is represented by dotted lines in the figure. 

Figures 7.1 and 7.2 also show, in bold, some of the lines connecting the points 
representing actions. Points on these lines do not correspond to any available option. 
However, if one was to choose between two actions, say a 3 and a 5 , at random, then 
the expected losses would lie on that line—here the expectation is with respect to the 
randomization. Rules in which one is allowed to randomly pick actions are called 
randomized rules; we will discuss them in more detail in Section 7.2. Sometimes 
using randomized rules it is possible to achieve a lower maximum loss than with any 
of the available ones, so they are interesting for a minimax agent. For example, in 
Figure 7.2, to find a minimax randomized action we move the wedge of the indif¬ 
ference set until it contacts the segment joining points a 3 and a 5 . This corresponds 
to a randomized decision selecting action a 3 with probability 3/4 and action a 5 with 
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L(0 2 , a) 



Figure 7.2 Losses and minimax action. The minimax action is a 3 , represented by 
point y 3 . The minimax randomized action is to select action a 3 with probability 3/4 
and action a 3 with probability 1 /4. 


probability 1 /4. By contrast, suppose the prior is now jt(0 ] ) = 1 /2. Then the dashed 
lines of Figure 7.1 would make contact with the entire segment between points y 3 
and y 5 at once. Thus, actions a 3 and a 5 , and any randomized decision between a 3 
and a 5 , would be Bayes actions. However, no gains could be obtained from choos¬ 
ing a randomized action. This is an instance of a much deeper result discussed in 
Section 7.2. 


7.2 Data-based decisions 

7.2.1 Risk 

From a statistical viewpoint, the interesting questions arise when the outcome of an 
experiment whose distribution depends on the parameter 6 is available. To establish 
notation, let x denote the experimental outcome with possible values in the set X, and 
f(x\0 ) is the probability density function (or probability mass, for discrete random 
variables). This is also called the likelihood function when seen as a function of 6. 
The question is: how should one use the data to make an optimal decision? To explore 
this question, we define a decision function (or decision rule ) to be any function <S(x) 
with domain X and range in A. A decision function is a recipe for turning data into 








DECISION FUNCTIONS 


121 


actions. We denote the class of all decision rules by V. The minimax and Bayes 
principles provide alternative approaches to evaluating decision rules. 

A comment about notation: we use x and 9 to denote both the random variables 
and their realized values. To keep things straight when computing expectations, we 
use E x [g(x , 6)] to denote the expectation of the function g with respect to the marginal 
distribution of x, E x \ e [g(x, 0)] for the expectation of g with respect to f{x\9), and 
E e [g(x, 9)\ for the expectation of g with respect to the prior on 0. 

Wald’s original theory is based on the expected performance of a decision rule 
prior to the observation of the experiment, measured by the so-called risk function. 

Definition 7.3 (Risk function) The risk function of a decision rule 8 is 

R(0,8) = [ L{0,8)f(x\0)dx. (7.7) 

J x 

The risk function was introduced by Wald to unify existing approaches for the evalu¬ 
ation statistical procedures from a frequentist standpoint. It focuses on the long-term 
performance of a decision rule in a series of repetitions of the decision problems. 
Some of the industrial applications that were motivating Wald have this flavor: the 
Statistician crafts rules that are applied routinely as part of the production process, 
and are evaluated based on average performance. 


7.2.2 Optimality principles 

To define the minimax and expected utility principles in terms of decision rules 
consider the parallel between the risk R(9,8) and the loss L(0,a). Just as L can be 
used to choose among actions, R can be used to choose among decision functions. 
Definitions 7.1 and 7.2 can be restated in terms of risk. For minimax: 


Definition 7.4 (Minimax decision rule) A decision rule S M is minimax if 

supR(d,8 M )=mfmpR(9,8). (7.8) 

9 4 9 

To define the Bayes rule we first establish a notation for the Bayes risk: 

Definition 7.5 (Bayes risk) The Bayes risk associated with prior distribution n 
and decision strategy 8 is 


r{n,8) = [ R(9, 

Jq 


8)Tt{0)d0. 


(7.9) 


The Bayes rules minimize the Bayes risk, that is: 

Definition 7.6 (Bayes decision rule) A decision rule 8* is Bayes with respect to tc if 


r(n, 8*) = inf r(n, 8). 


(7.10) 
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What is the relationship between the Bayes rule and the expected utility 
principle? There is a simple and intuitive way to determine a Bayes strategy, which 
will also clarify this question. For every x, use the Bayes rule to determine the 
posterior distribution of the states of the world 


7T(0\x) = 


7T(.e)f(x\6) 

m(x) 


(7.11) 


where 

m(x) = [ Tz(0)f(x\0)d0. (7.12) 

This will summarize what is known about the state of the world given the experi¬ 
mental outcome. Then, just as in Definition 7.2, find the Bayes action by computing 
the expected loss. The difference now is that the relevant distribution for computing 
the expectation will not be the prior tt{9) but the posterior ir(Q\x) and the function to 
be minimized will be the posterior expected loss 

C Kx (a) = [ L(0,a)jt(d\x)d9. (7.13) 

Je 


This is a legitimate step as long as we accept that static conditional probability also 
constrain one’s opinion dynamically, once the outcome is observed (see also Chap¬ 
ters 2 and 3). The action that minimizes the posterior expected loss will depend on 
x. So, in the end, this procedure implicitly defines a decision rule, called the formal 
Bayes rule. 

Does the formal Bayes rule satisfy Definition 7.6? Consider the relationship 
between the posterior expected loss and the Bayes risk: 


r(jr, 8 ) 


// 

Jet Jx 


L{9, 8)f(x\9)n(9)dxd9 

m(x)d.x. 


j I L(9,8)jt(9\x)d9 

J x LJe 


(7.14) 

(7.15) 


assuming we can reverse the order of the integrals. The quantity in square brackets 
is the posterior expected loss. Our intuitive recipe for finding a Bayes strategy was 
to minimize the posterior expected loss for every x. But in doing this we inevitably 
minimize r as well, and this satisfies Definition 7.6. Conversely, if we wish to min¬ 
imize r with respect to the function 8 , we must do so pointwise in x, by minimizing 
the integral in square brackets. 

The conditions for interchangeability of integrals are not to be taken for granted. 
See Problem 8.7 for an example. Another important point is that the formal Bayes 
rule may not be unique, and there are examples of nonunique formal Bayes rules 
whose risk functions differ. More about this in Chapter 8—see for example worked 
Problem 8.1. 
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7.2.3 Rationality principles and the Likelihood Principle 

The use of formal Bayes rules is backed directly by the axiomatic theory, and has 
profound implications for statistical inference. First, for a given experimental out¬ 
come x, a Bayes rule can be determined without any averaging over the set of all 
possible alternative experimental outcomes. Only the probabilities of the observed 
outcome under the various possible states of the world are relevant. Second, all fea¬ 
tures of the experiment that are not captured in f(x\9) do not enter the calculation 
of the posterior expected loss, and thus are also irrelevant for the statistical deci¬ 
sion. The result is actually stronger: f{x\9) can be multiplied by an arbitrary nonzero 
function of x without altering the Bayes rule. Thus, for example, the data can be 
reduced by sufficiency, as can be seen by plugging the factorization theorem (details 
in Section 8.4.2) into the integrals above. This applies to both the execution of a 
decision once data are observed and the calculation of the criterion that is used for 
choosing between rules. 

That all of the information in x about 9 is contained in the likelihood function is 
a corollary of the expected utility paradigm, but is so compelling that it is taken by 
some as an independent principle of statistical inference, under the name of the Like¬ 
lihood Principle. This is a controversial principle because the majority of frequentist 
measures of evidence, including highly prevalent ones such as coverage probabilities, 
and p-values, violate it. A monograph by Berger and Wolpert (1988) elaborates on 
this theme, and includes extensive and insightful discussions by several other authors 
as well. We will return to the Likelihood Principle in relation to optional stopping in 
Section 15.6. 

Example 7.1 While decision rules derived from the expected utility principle sat¬ 
isfy the Likelihood Principle automatically, the same is not true of minimax rules. 
To illustrate this point in a simple example, return to the loss function in Table 7.1 
and consider the regret form. In the absence of data, the minimax action is a\. Now, 
suppose you can observe a binary variable x, and you can do so under two alterna¬ 
tive experimental designs with sampling distributions/ 1 and/ 2 , shown in Table 7.3. 
Because there are two possible outcomes and two possible actions, there are four 
possible decision functions: 

au that is you choose a, regardless of the experimental outcome. 
a 2 , that is you choose a 2 regardless of the experimental outcome. 

jfl! if x = 1, 

| a 2 if x = 0. 

{ fli if x = 0, 

a 2 if x = 1. 


<5,(x) = 
S 2 (x) = 

8,(x) = 
<5 4 (x) = 


In the sampling models of Table 7.3 we have/ 2 (x = 1| 9) — 3/' (x = 1| 9) for all 
9 e ©; that is, if x = 1 is observed, the two likelihood functions are proportional to 
each other. This implies that when x = 1 the expected utility rule will be the same 
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Table 7.3 Probability functions for two alternative sampling 
models. 




0i 

$2 

$3 

f(x = 

m 

0.20 

0.10 

0.25 

f\x = 

m 

0.60 

0.30 

0.75 


under either sampling model. Does the same apply to minimax? Let us consider the 
risk functions for the four decision rules shown in Table 7.4. Linder / 1 the mini¬ 
max decision rule is <5; w (x) = <5 4 (x). However, under/ 2 the minimax decision rule is 
<5“(x) = f5 1 (x) = a\. Thus, if we observe x = 1 under/ 1 , the minimax decision is a 2 , 
while under/ 2 it is a.\. This is a violation of the Likelihood Principle. 

For a concrete example, imagine measuring an ordinal outcome y with categories 
0,1 /3,2/3,1, with likelihood as in Table 7.5. Rather than asking about y directly we 
can use two possible questionnaires giving dichotomous answers x. One, correspond¬ 
ing to/ 1 , dichotomizes y into 1 versus all else, while the other, corresponding to/ 2 , 
dichotomizes y into 0 versus all else. Because categories 1/3,2/3,1 have the same 
likelihood,/ 2 is the better instrument overall. However, if the answer is x = 1, then 
it does not matter which instrument is used, because in both cases we know that the 
underlying latent variable must be either 1 or a value which is equivalent to it as far 
as learning about 0 is concerned. The fact that in a different experiment the outcome 
could have been ambiguous about y in one dichotomization and not in the other is not 
relevant according to the Likelihood Principle. However, the risk function R, which 


Table 7.4 Risk functions for the four decision rules, under the two 
alternative sampling models of Table 7.3. 




Under/‘(.|0) 


Under / 2 (.|<9) 


0i 

$2 

03 

0i 

$2 

03 


0.00 

0.00 

1.00 

0.00 

0.00 

1.00 

S 2 (x) 

2.00 

4.00 

0.00 

2.00 

4.00 

0.00 

Ux) 

1.60 

3.60 

0.25 

0.80 

2.80 

0.75 

&a(x) 

0.40 

0.40 

0.75 

1.20 

1.20 

0.25 

Table 7.5 

Probability functions for the unobserved outcome y underlying 

the two sampling models of Table 7.3. 






0i 


02 


03 

f(y=m 


0.20 


0.10 


0.25 

f(y = 2/3|0) 

0.20 


0.10 


0.25 

f(y= i/3|0) 

0.20 


0.10 


0.25 

f(y = O|0) 


0.40 


0.70 


0.25 
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depends on the whole sampling distribution, and is concerned about long-run average 
performance of the rule over repeated experiments, is affected. A couple of famous 
examples of standard inferential approaches that violate the Likelihood Principle in 


somewhat embarrassing ways are in Problems 7.12 and 8.5. 


★ 


7.2.4 Nuisance parameters 

The realistic specification of a sampling model often requires parameters other than 
those of primary interest. These additional parameters are called “nuisance parame¬ 
ters.” This is one of the very few reasonably named concepts in statistics, as it causes 
all kinds of trouble to frequentist and likelihood theories alike. Basu (1975) gives a 
critical discussion. 

From a decision-theoretic standpoint, we can think of nuisance parameters as 
those which appear in the sampling distribution, but not in the loss function. We 
formalize this notion from a decision-theoretic viewpoint and establish a general 
result for dealing with nuisance parameters in statistics. The bottom line is that the 
expected utility principle justifies averaging the likelihood and the prior over the pos¬ 
sible values of the nuisance parameter and taking things from there. In probabilistic 
terminology, nuisance parameters can be integrated out, and the original decision 
problem can be replaced by its marginal version. More specifically: 

Theorem 7.1 If 9 can be partitioned into (6*, ii) such that the loss L(9,a) depends 
on 9 only through 9*, then ;; is a nuisance parameter and the Bayes rule for the 
problem with likelihood f{x\9) and prior it (9) is the same as the Bayes rule for the 
problem with likelihood 



and prior 



where H is the domain of r\. 

Proof: We assume that all integrals involved are finite. Take any decision rule 8. The 
Bayes risk is 


r(jt,8)= L(9,8)f(x\9)n{9)dxd9 
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that is the Bayes risk for the problem with likelihood f*(x\6*) and prior 
We used the independence of L on i], and the relation tt(0) = n(r], ()*) = 


□ 


n(ri\e*)n(6*). 


This theorem puts to rest the issue of nuisance parameters in every conceivable 
statistical problem, as long as one can specify reasonable priors, and compute inte¬ 
grals. Neither is easy, of course. Priors on high-dimensional nuisance parameters can 
be very difficult to assess based on expert knowledge and often include surprises in 
the form of difficult-to-anticipate implications when nuisance parameters are inte¬ 
grated out. Integration in high dimension has made much progress over the last 20 
years, thanks mostly to Markov chain Monte Carlo (MCMC) methods (Robert & 
Casella 1999), but is still hard, in part because we tend to adapt to this progress and 
specify models that are at the limit of what is computable. Nonetheless, the elegance 
and generality of the solution are compelling. 

Another way to interpret the Bayesian solution to the nuisance parameter prob¬ 
lem is to look at posterior expected losses. An argument similar to that used in the 
proof of Theorem 7.1 would show that one can equivalently compute the posterior 
expected losses based on the marginal posterior distribution of 0* given by 



This highlights the fact that the Bayes rule is potentially affected by any of the fea¬ 
tures of the posterior distribution Tt{r]\x) of the nuisance parameter, including all 
aspects of the uncertainty that remains about them after observing the data. This is 
in contrast to approaches that eliminate nuisance parameters by “plugging in” best 
estimates either in the likelihood function or in the decision rule itself. The empirical 
Bayes approach of Section 9.2.2 is an example. 

7.3 The travel insurance example 

In this section we introduce a mildly realistic medical example that will hopefully 
afford the simplest possible illustration of the concepts introduced in this chapter 
and also give us the excuse to introduce some terminology and graphics from deci¬ 
sion analysis. We will return to this example when we consider multistage decision 
problems in Chapters 12 and 13. 

Suppose that you are from the United States and are about to take a trip overseas. 
You are not sure about the status of your vaccination against a certain mild disease 
that is common in the country you plan to visit, and need to decide whether to buy 
medical insurance for the trip. We will assume that you will be exposed to the disease, 
but you are uncertain about whether your present immunization will work. Based on 
aggregate data on western tourists, the chance of developing the disease during the 
trip is about 3% overall. Treatment and hospital abroad would normally cost you, 
say, 1000 dollars. There is also a definite loss in quality of life in going all the way to 
an exotic country and being grounded at a local hospital instead of making the most 
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out of your experience, but we are going to ignore this aspect here. On the other 
hand, if you buy a travel insurance plan, which you can do for 50 dollars, all your 
expenses will be covered. This is a classical gamble versus sure outcome situation. 
Table 7.6 summarizes the loss function for this problem. 

For later reference we are going to represent this simple case using a decision 
tree. In a decision tree, a square denotes a decision node or decision point. The deci¬ 
sion maker has to decide among actions, represented by branches stemming out from 
the decision node. A circle represents a chance node or chance point. Each branch 
out of the circle represents, in this case, a state of nature, though circles could also 
be used for experimental results. On the right side of the decision tree we have the 
consequences. Figure 7.3 shows the decision tree for our problem. 

In a Bayesian mode, you use the expected losses to evaluate the two actions, as 
follows: 


No insurance: Expected loss = 1000 x 0.03 + 0 x 0.97 = 30 


Insurance: Expected loss = 50 x 0.03 + 50 x 0.97 = 50. 


Table 7.6 Monetary losses associated with buying 
and with not buying travel insurance for the trip. 


Actions 


Events 


6p. ill 

9 2 : not ill 

Insurance 

50 

50 

No insurance 

1000 

0 



Figure 7.3 Decision tree for the travel insurance example. This is a single-stage 
tree, because it includes only one decision node along any given path. 
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Figure 7.4 Solved decision tree for the medical insurance example. At the top of 
each chance node we have the expected loss, while at the top of the decision node 
we have the minimum expected loss. Alongside the branches stemming out from the 
chance node we have the probabilities of the states of nature. The action that is not 
optimal is crossed out by a double line. 

The Bayes decision is the decision that minimizes the expected loss—in this case not 
to buy the insurance. However, if the chance of developing the disease was 5% or 
greater, the best decision would be to buy the insurance. The solution to this decision 
problem is represented in Figure 7.4. 

You can improve your decision making by gathering data on how likely you are 
to get the disease. Imagine you have the option of undergoing a medical test that 
informs you about whether your immunization is likely to work. The test has only 
two possible verdicts. One indicates that you are prone to the disease and the other 
indicates that you are not. Sticking with the time-honored medical tradition of call¬ 
ing “positive” the results of tests that suggest the presence of the most devastating 
illnesses, we will call positive the outcome indicating that you are disease prone. 
Unfortunately, the test is not perfectly accurate. Let us assume that, after some clini¬ 
cal experiments, it was determined that the probability that the test is positive when 
you really are going to get the disease is 0.9, while the probability that the test is 
negative when you are not going to get the disease is 0.77. For a perfect test, these 
numbers would be both 1. In medical terminology, the first probability represents the 
sensitivity of the test, while the second one represents the specificity of the test. Call 
x the indicator variable for the event, “The test is positive.” In the notation of this 
chapter, the probabilities available so far are 


— 0.03 
f(x= 1100 = 0.90 
fix = 0102) = 0.77. 
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After the test, your individual chances of illness will be different from the overall 
3%. The test could provide valuable information and potentially alter your chosen 
course of action. The question of this chapter is precisely how to use the results of 
the test to make a better decision. The test seems reliable enough that we may want 
to buy the insurance if the test is positive and not otherwise. Is this right? 

To answer this question, we will consider decision rules. In our example there 
are two possible experimental outcomes and two possible actions so there are a total 
of four possible decisions rules. These are 

i5 0 (x): Do not buy the insurance. 

<5i(x): Buy the insurance if x — 1. Otherwise, do not. 

S 2 (x): Buy the insurance if x = 0. Otherwise, do not. 

<5 3 (x): Buy the insurance. 

Decision rules <5 0 and <5 3 choose the same action irrespective of the outcome of the 
test: they are constant functions. Decision rule A does what comes naturally: buy the 
insurance only if the test indicates that you are disease prone. Decision rule S 2 does 
exactly the opposite. As you might expect, it will not turn out to be very competitive. 

Let us now look at the losses associated with each decision rule ignoring, for 
now, any costs associated with testing. Of course, the loss for rules <5 3 and 8 2 now 
depends on the data. We can summarize the situation as shown in Table 7.7. 

Two unknowns will affect how good our choice will turn out to be: the test result 
and whether you will be ill during the trip. As in equation (7.14) we can choose the 
Bayes rule by averaging out both, beginning with averaging losses by state, and then 
further averaging the results to obtain overall average losses. The results are shown 
in Table 7.8. To illustrate how entries are calculated, consider <V The average risk if 
6 — 6i is 


1000/(x= 0|(9 1 ) + 50/(x= 1 \0i) — 1000 x 0.10 + 50 x 0.90 = 145.0 
while the average risk if 6 = 0 2 is 

0 f(x = O|0 2 ) + 50 f(x = 1|0 2 ) = 0 x 0.77 + 50 x 0.23 = 11.5, 


Table 7.7 Loss table for the decision rules in the travel insurance 
example. 



0i : 

ill 

0 2 : not ill 

x — 0 

X = 1 

x — 0 

X = 1 

<$o(x) 

$1000 

$1000 

$0 

$0 

«l(*) 

$1000 

$50 

$0 

$50 

8 2 (x) 

$50 

$1000 

$50 

$0 

8 3 (x) 

$50 

$50 

$50 

$50 
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Table 7.8 Average losses by state and overall for the decision rules 
in the travel insurance example. 



Average losses by state 

Average losses overall 


9,: ill 

9 2 : not ill 


<5 0 (x) 

$1000.0 

$0.0 

$30.0 

SAx) 

$145.0 

$11.5 

$15.5 

8 2 (x) 

$905.0 

$38.5 

$64.5 

S 3 (x) 

$50.0 

$50.0 

$50.0 


so that the overall average is 

145.0 x n(d l )+ 11.5 x jt(9 2 ) = 15.5 = 145.0 x 0.03 + 11.5 x 0.97 = 15.5. 


Strategy <5, (x) is the Bayes strategy as it minimizes the overall expected loss. 
This calculation is effectively considering the losses in Table 7.7 and computing the 
expectation of each row with respect to the joint distribution of 9 and x. In this sense 
it is consistent with preferences expressed prior to observing x. You are bound to 
stick to the optimal rule after you actually observe x only if you also agree with the 
before/after axiom of Section 5.2.3. Then, an alternative derivation of the Bayes rule 
could have been worked out directly by computing posterior expected losses given 
x — 1 and x = 0, as we know from Section 7.2.2 and equation (7.15). 

So far, in solving the decision problem we utilized the Bayes principle. Alterna¬ 
tively, if you follow the minimax principle, your goal is avoiding the largest possible 
loss. Let us start with the case in which no data are available. In our example, the 
larges loss is 50 dollars if you buy the insurance and 1000 if you do not. By this prin¬ 
ciple you should buy the medical insurance. In fact, as we examine Table 7.6, we note 
that the greatest loss is associated with event 9\ no matter what the action is. There¬ 
fore the maximization step in the minimax calculation will resolve the uncertainty 
about 9 by assuming, pessimistically, that you will become ill, no matter how much 
evidence you may accumulate to the contrary. To alleviate this drastic pessimism, let 
us express the losses in “regret” form. The argument is as follows. If you condition 
on getting ill, the best you can do is a loss of $50, by buying the medical insurance. 
The alternative action entails a loss of $1000. When you assess the worthiness of this 
action, you should compare the loss to the best (smallest) loss that you could have 
obtained. You do indeed lose $1000, but your “regret” is only for the $950 that you 
could have avoided spending. Applying equation (7.2) to Table 7.6 gives Table 7.9. 

When reformulating the decision problem in terms of regret, the expected 
losses are 

No insurance: Expected loss = 950 x 0.03 + 0 x 0.97 = 28.5 
Insurance: Expected loss = 0 x 0.03 + 50 x 0.97 = 48.5. 
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Table 7.9 Regret loss table for the actions in the 
travel insurance example. 




Event 




01 

02 

Decision: 

insurance 

$0 

$50 


no insurance 

$950 

$0 


Table 7.10 Risk table for the decision rules in the medical insurance 


example when using regret losses. 



Risk R(0, 8) by state 

Largest risk 

Average risk r(jt, 8) 


0i 

02 



Soto 

$950 

$0 

$950 

$28.5 

Si to 

$95 

$11.5 

$95 

$14.0 

s 2 to 

$855 

$38.5 

$855 

$63.0 

Sato 

$0 

$50 

$50 

$48.5 


The Bayes action remains the same. The expected losses become smaller, but the 
expected loss of every action becomes smaller by the same amount. On the contrary, 
the minimax action may change, though in this example it does not. The minimax 
solution is still to buy the insurance. 

Does the optimal minimax decision change depending on the test results? In 
Table 7.10 we derive the Bayes and minimax rules using the regret losses. Strategy 
<$! is the Bayes strategy as we had seen before. We also note that <S 2 is dominated by 
<$!, that is it has a higher risk than <$] irrespective of the true state of the world. Using 
the minimax approach, the optimal decision is <5 3 , that is it is still optimal to buy 
the insurance irrespective of the test result. This conclusion depends on the losses, 
sensitivity, and specificity, and different rules could be minimax if these parameters 
were changed. 

This example will reappear in Chapters 12 and 13, when we will consider both 
the decision of whether to do the test and the decision of what to do with the 
information. 


7.4 Randomized decision rules 

In this section we briefly encounter randomized decision rules. From a frequentist 
viewpoint, the randomized decision rules are important because they guarantee, for 
example, specified error levels in the development of hypothesis testing procedures 
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and confidence intervals. From a Bayesian viewpoint, we will see that randomized 
decision rules are not necessary, because they cannot improve the Bayes risk 
compared to nonrandomized decision rules. 

Definition 7.7 (Randomized decision rule) A randomized decision rule 8 R {x ,.) 
is, for each x, a probability distribution on A. In particular, 8 R (x,A) denotes the 
probability that an action in A (a subset of A) is chosen. The class of randomized 
decision rules is denoted by T> R . 

Definition 7.8 (Loss function for a randomized decision rule) A randomized 
decision rule 8 R (x ,.) has loss 


L(0,8 r (x)) = E S * M L(0, a) 


(7.16) 



We note that a nonrandomized decision rule is a special case of a randomized 
decision rule which assigns, for any given x, a specific action with probability one. 

In the simple setting of Figure 7.1, no randomized decision rule in V R can 
improve the Bayes risk attained with a nonrandomized Bayes decision rule in V. 
This turns out to be the case in general: 

Theorem 7.2 For every prior distribution it on 0, the Bayes risk on the set of 
randomized estimators is the same as the Bayes risk on the set of nonrandomized 
estimators, that is 


infr(jr, 5) = inf r{jx,8 R ). 


5 R eT> R 


For a proof see Robert (1994). This result continues to hold when the Bayes risk is 
not finite, but does not hold if r is replaced by R(0,8) unless additional conditions 
are imposed on the loss function (Berger 1985). 

DeGroot comments on the use of randomized decision rules: 

This discussion supports the intuitive notion that the statistician should 
not base an important decision on the outcomes of the toss of a coin. 

When two or more pure decisions each yield the Bayes risk, an auxiliary 
randomization can be used to select one of these Bayes decisions. How¬ 
ever, the randomization is irrelevant in this situation since any method of 
selecting one of the Bayes decisions is acceptable. In any other situation, 
when the statistician makes use of a randomized procedure, there is a 
chance that the final decision may not be a Bayes decision. 

Nevertheless, randomization has an important use in statistical work. 

The concepts of selecting a random sample and of assigning different 
treatments to experimental units at random are basic ones for the per¬ 
formance of effective experimentation. These comments do not really 
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conflict with those in the preceding paragraph which indicate that 
the statistician need never use randomized decisions (DeGroot 1970, 
pp. 129-130) 

The next three sections illustrate these concepts in common statistical decision 
problems: classification, hypothesis testing, point and interval estimation. 


7.5 Classification and hypothesis tests 

7.5.1 Hypothesis testing 

Contrasting their approach to Fisher’s significance testing, Neyman and Pearson 
write: 


no test based upon a theory of probability can by itself provide any valu¬ 
able evidence of the truth or falsehood of a hypothesis. But we may look 
at the purpose of tests from another viewpoint. Without hoping to know 
whether each separate hypothesis is true or false, we may search for rules 
to govern our behaviour with regard to them, in following which we 
insure that, in the long run of experience, we shall not often be wrong. 
(Neyman and Pearson 1933, p. 291) 

This insight was one of the foundations of Wald’s work, so hypothesis tests are a 
natural place to start visiting some examples. In statistical decision theory, hypothesis 
testing is typically modeled as the choice between actions a 0 and a,, where a, denotes 
accepting hypothesis 77,: 0 e ©,, with i either 0 or 1. Thus, A = {a,,, «i} and 0 = 
0(110!. A discussion of what it really means to accept a hypothesis could take us 
far astray, but is not a point to be taken lightly. An engaging reading on this subject is 
the debate between Fisher and Neyman about the meaning of hypothesis tests, which 
you can track down from Fienberg (1992). 

If the hypothesis test is connected to a concrete problem, it may be possible to 
specify a loss function that quantifies how to penalize the consequences of choos¬ 
ing decision a 0 when II, is true, and decision a, when H 0 is true. In the simplest 
formulation it only matters whether the correct hypothesis is accepted, and errors 
are independent of the variation of the parameter 9 within the two hypotheses. This 
is realistic, for example, when the practical consequences of the decision depend 
primarily on the direction of an effect. Formally, 

L(9,a 0 ) = L 0 I 

(9e©ll 

L{9,a i) = Li/( 0 £ @ o |. 

Here L 0 and Li are positive numbers and I A is the indicator of the event A. Table 7.11 
restates the same assumption in a more familiar form. 
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Table 7.11 A simple loss function 
for the hypothesis testing problem. 



9 £ ©o 

9 e ©! 

Go 

0 

L 0 

(X\ 

u 

0 


As the decision space is binary, any decision rule 8 must split the set of possible 
experimental results into two sets: one associated with a 0 and the other with a,. The 
risk function is 


R(9,8) = j L(9,8)f{x\9)dx 

J x 


| LiP(8(x) = \9) = Lici s {6), if 0 e 0 O 

= ao\0) = L 0 p s (9), if 0 e ©i■ 


From the point of view of finding optima, losses can be rescaled arbitrarily, so 
any solution, minimax or Bayes, will only depend on L, and L 0 through their 
ratio. 

If the sets 0 O and 0! are singletons (the familiar simple versus simple hypotheses 
test) we can, similarly to Figures 7.1 and 7.2, represent any decision rule as a point 
in the space R(9 0 ,8),R(9 l ,8), called the risk space. The Bayes and minimax optima 
can be derived using the same geometric intuition based on lines and wedges. If the 
data are from a continuous distribution, the lower solid line of Figures 7.1 and 7.2 
will often be replaced by a smooth convex curve, and the solutions will be unique. 
With discrete data there may be a need to randomize to improve minimax solutions. 

We will use the terms null hypothesis for H 0 and alternative hypothesis for 
Hi even though the values of L 0 and L, are the only real element of differenti¬ 
ation here. In this terminology, a s and can be recognized as the probabilities 
of type I and II errors. In the classical Neyman-Pearson hypothesis testing frame¬ 
work, rather than specifying L 0 and Li one sets the value for sup (ee@o) ce s (9) 
and minimizes /3 s (9),9e® 1( with respect to 8. Because /6 s (9) is one minus the 
power, a uniformly most powerful decision rule will have to dominate all others 
in the set ©,. For a comprehensive discussion on classical hypothesis testing see 
Lehmann (1997). 

In the simple versus simple case, it is often possible to set Li/L 0 so that the 
minimax approach and Neyman-Pearson approach for a fixed a lead to the same 
rule. The specification of a and that of Li/L 0 can substitute for each other. In all 
fields of science, the use of a = 0.05 is often taken for granted, without any serious 
thought being given to the implicit balance of losses. This has far-reaching negative 
consequences in both science and policy. 
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Moving to the Bayesian approach, we begin by specifying a prior jt on 6. Before 
making any observation, the expected losses are 


E[L(9,a)] 


L 


L{6, a)it (9)d9 


|L 0 (l — jz(6 g @ 0 )), if a = a 0 , 
[Lijr(0 g ©o), ifa = ai. 


Thus, the Bayes action is a 0 if 


Lq It (9 G ©o) 

~L X < 1 -7t(9 e 0 O ) 

and otherwise it is a x . In particular, if L 0 — L t . the Bayes action is a 0 whenever 
7 r{9 e @ 0 ) > 1/2. By a similar argument, after data are available, the function 


E[L(6, «)| jc] 


I Lo( 1 — tt x (0 G ©„)), if a — do, 
L x jt x {9 G © 0 ), if a = a x . 


is the posterior expected loss and the formal Bayes rule is to choose a 0 if 

L 0 ^ x (9 g ©o) 

L\ 1 — 7T X (0 G ©o) 

and otherwise to choose a,. The ratio of posterior probabilities on the right hand side 
can be written as 

7T x (9 G ©o) _ tt(9 g &o)f(x\e G ©o) 

1 - tt x {6 g ©o) _ (1 — Jt(9 G 0 o ))/(x|0 G ©O’ 


where 

f(x\6 G ©,) = [ f(x\9)it(6\9 G &,)d9 
J&i 

for i = 0,1. The ratio of posterior probabilities can be further expanded as 

7T X = 7T f(x\6 G Qq) , 

1 -Jt x ~ 1 -IT f(x\9 G ©O’ ^ ^ 

that is, a product of the prior odds ratio and the so-called Bayes factor BF —f(x\9 e 
© o)/f(x\9 G ©i). The decision depends on the data only through the Bayes factor. In 
the definition above we follow Jeffreys’ definition (Jeffreys 1961) and place the null 
hypothesis in the numerator, but you may find variation in the literature. Bernardo 
and Smith (1994) provide further details on the Bayes factors in decision theory. 

The Bayes factor has also been proposed by Harold Jeffreys and others as a direct 
and intuitive measure of evidence to be used in alternative to, say, p-values for quan¬ 
tifying the evidence against a hypothesis. See Kass and Raftery (1995) and Goodman 
(1999) for recent discussions of this perspective. Jeffreys (1961, Appendix B) pro¬ 
posed the rule of thumb in Table 7.12 to help with the interpretation of the Bayes 
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Table 7.12 Interpretation of the Bayes factors for comparison between 
two hypotheses, according to Jeffreys. 


Grade 

2 log l0 BF 

Evidence 

0 

> o 

Null hypothesis supported 

1 

-1/2 toO 

Evidence against H 0 , 

but not worth more than a bare mention 

2 

-1 to-1/2 

Evidence against H 0 substantial 

3 

-3/2 to -1 

Evidence against H 0 strong 

4 

-2 to-3/2 

Evidence against H 0 very strong 

5 

< —2 

Evidence against H 0 decisive 


factor outside of a decision context. You can be the judge of whether this is fruitful — 
in any case, the contrast between this approach and the explicit consideration of 
consequences is striking. 

The Bayes decision rule presented here is broadly applicable beyond a binary 
partition of the parameter space. For example, it extends easily to nuisance param¬ 
eters, and can be applied to any pair of hypotheses selected from a discrete set, as 
we will see in the context of model choice in Section 11.1.2. One case that requires 
a separate and more complicated discussion is the comparison of a point null with a 
composite alternative (see Problem 7.3). 


7.5.2 Multiple hypothesis testing 

A trickier breed of decision problems appears when we wish to jointly test a battery 
of related hypotheses all at once. For example, in a clinical trial with four treat¬ 
ments, there are six pairwise comparisons we may be interested in. At the opposite 
extreme, a genome-wide scan for genetic variants associated with a disease may 
give us the opportunity to test millions of associations between genetic variables 
and a disease of interest (Hirschhorn and Daly 2005). In this section we illustrate a 
Bayesian decision-theoretic approach based on Muller el al. (2005). We will not ven¬ 
ture here into the frequentist side of multiple testing. A good entry point is Hochberg 
and Tamhane (1987). Some considerations contrasting the two approaches are in 
Berry and Hochberg (1999), Genovese and Wasserman (2003), and Muller et al. 
(2007b). 

The setting is this. We are interested in I null hypotheses 0, = 0, with i — 1,...,/, 
to be compared against the corresponding alternatives 0, = 1. Available decisions 
for each i are a rejection of the null hypotheses (a, = 1), or not (a, = 0). In mas¬ 
sive comparison problems, rejections are sometimes called discoveries, or selections, 
/-dimensional vectors 0 and a line up all the hypotheses and corresponding decisions. 
The set of indexes such as a, = 1 is a list of discoveries. In many applications, list 
making is a good way to think about the essence of the problem. To guide us in the 
selection of a list we observe datax with distribution /(x|0, ?;), where // gathers any 
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remaining model parameters. A key quantity is n x (6i = 1), the marginal posterior 
probability that the ;th null hypothesis is false. The nuisance parameters 11 can be 
removed by marginalization at the start, as we saw in Section 7.2.4. 

The choice of a specific loss function is complicated by the fact that the 
experiment involves two competing goals, discovering as many as possible of the 
components that have 0, = 1, while at the same time controlling the number of false 
discoveries. We discuss two alternative utility functions that combine the two goals. 
These capture, at least as a first approximation, the goals of massive multiple com¬ 
parisons, are easy to evaluate, lead to simple decision rules, and can be interpreted 
as generalizations of frequentist error rates. Interestingly, all will lead to terminal 
decision rules of the same form. Other loss functions for multiple comparisons are 
discussed in the seminal work of Duncan (1965). 

We start with the notation for the summaries that formalize the two competing 
goals of controlling false negative and false positive decisions. The realized counts 
of false discoveries and false negatives are 

FD(a,0) = ^a,(l-0,) 

i 

FN (a, 0) = “‘re¬ 


writing D = Y2 a i f° r the number of discoveries, the realized percentages of wrong 
decisions in each of the two lists, or false discovery rate and false negative rate, are, 
respectively 


FDR(a, 0) 
FNR(a, 0) 


FD (a, 0) 

D + e 
FD (a, 0) 

I-D + e' 


FD(-), FN( ), FDR(-), and FNR(-) are all unknown. The additional term e avoids a 
zero denominator. 

We consider two ways of combining the goals of minimizing false discoveries 
and false negatives. The first two specifications combine false negative and false 
discovery rates and numbers, leading to the following loss functions: 


L N (a,6) = yfeFD + FN 
L R (a,0) = &FDR + FNR. 


The loss function L N is a natural extension of the loss function of Table 7.11 with 
k — Li/L 0 . From this perspective the combination of error rates in L R seems less 
attractive, because the losses for a false discovery and a false negative depend on the 
total number of discoveries or negatives, respectively. 

Alternatively, we can model the trade-offs between false negatives and false pos¬ 
itives as a multicriteria decision. For the purpose of this discussion you can think of 
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multicriteria decisions as decisions in which the loss function is multidimensional. 
We have not seen any axiomatic foundation for this approach, which you can learn 
about in Keeney el al. (1976). However, the standard approach to selecting an action 
in multicriteria decision problems is to minimize one dimension of the expected loss 
while enforcing a constraint on the remaining dimensions. The Neyman-Pearson 
approach to maximizing power subject to a fixed type I error, seen in Section 7.5.1, 
is an example. We will call L 2N and L 2R the multicriteria counterparts of L N 
and L r . 

Conditioning on x and marginalizing with respect to 6, we obtain the posterior 
expected FD and FN 


FD x (a) = ^ «,(1 - 7 tM = 1)) 
FN M = £(1 - adnM = 1) 


and the corresponding FDR v (a) = ¥D x (a)/(D + e) and FNR v (a) = FN x (a)(/ — 
D + e). See also Genovese and Wasserman (2002). Using these quantities we can 
compute posterior expected losses for both loss formulations, and also define the 
optimal decisions under L 2N as minimization of FN subject to FD < a N . Similarly, 
under L 2R we minimize FNR subject to FDR < a R . 

Under all four loss functions the optimal decision about the multiple comparison 
is to select the dimensions that have a sufficiently high posterior probability 7r(0, = 
1 |x), using the same threshold for all dimensions: 

Theorem 7.3 Under all four loss functions the optimal decision takes the form 

a(t*) defined as a, = 1 if and only if 

The optimal choices off are 

f* = k/(k+ 1) 

t R (x) = v a-o*)_ 

f N (x) = minjj : FD.,.(a(i)) < a } 
t\jfx) = min{^ : FDR,.(a(s)) < a} 

In the expression for t* R , v (I) is the ith order statistic of the vector 
— 1 |x),..., 7r(0„ = 11 jc)}, and D* is the optimal number of discoveries. 

The proof is in the appendix of Muller et al. (2005). 

Under L R , L 2N , and L 2R the optimal threshold f depends on the observed data. The 
nature of the terminal decision rule a, is the same as in Genovese and Wasserman 
(2002), who discuss a more general rule, allowing the decision to be determined by 
cutoffs on any univariate summary statistic. 

A very early incarnation of this result is in Pitman (1965) who considers, from a 
frequentist standpoint, the case where one wishes to test I simple hypotheses versus 
the corresponding simple alternatives. The experiments and the hypotheses have no 


7t(0, = 1 |x) > f. 


for L n 
for L r 
for L 2n 
for L 2r . 


(7.18) 
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connection with each other. If the goal is to maximize the average power, subject to 
a constraint on the average type I error, the optimal solution is a likelihood ratio test 
for each hypothesis, and the same cutoff point is used in each. 


7.5.3 Classification 

Binary classification problems (Duda et al. 2000) consider assigning cases to one of 
two classes, based on measuring a series of attributes of the cases. An example is 
medical diagnosis. In the simplest formulation, the action space has two elements: 
“diagnosis of disease” (ct]) and “diagnosis of no disease” («»). The states of “nature” 
(the patient, really) are disease or no disease. The loss function, shown in Table 7.13, 
is the same we used for hypothesis testing, with L 0 representing the loss of diagnosing 
a diseased patient as healthy, and representing the loss of diagnosing a healthy 
patient as diseased. 

The main difference here is that we observe data on the attributes x and correct 
disease classification y of a sample of individuals, and wish to classify an additional 
individual, randomly drawn from the same population. The model specifies/(y,|x,, 6) 
for individual i. If x are the observed features of the new individual to be classi¬ 
fied, and y is the unknown disease state, the ingredients for computing the posterior 
predicted probabilities are 


n(y — 0| y,x,x) — M j f(y = 0|3c, 9)jt(0) f(y,x\6)d6 

Je 

x(y — 1| y,x,x) — M f f(y= l\x,6)n(6) f(y,x\G)d6 
J@ 

where M is the marginal probability of (x,\j. From steps similar to Section 7.5.1, the 
Bayes rule is to choose a diagnosis of no disease if 


7t(y — 0|y,x,x) L 0 

—z- 1 -— > —• 

7t(y = 1 | y,x,x) Li 

This classification rule incorporates inference on the population model, uncertainty 
about population parameters, and relative disutilities of misclassification errors. 


Table 7.13 A loss function for the binary classification 
problem. 



No disease (y = 0) 

Disease ( y = 1) 

a 0 

0 

L 0 

(2\ 

U 

0 
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7.6 Estimation 


7.6.1 Point estimation 


Statistical point estimation assigns a single best guess to an unknown parameter. 
This assignment can be viewed as a decision problem in which A = ©. Decision 
functions map data into point estimates, and they are also called estimators. Even 
though point estimation is becoming a bit outdated because of the ease of looking at 
entire distributions of unknowns, it is an interesting simplified setting for examining 
the implications of various decision principles. 

In this setting, the loss function measures the error from declaring that the esti¬ 
mate is a when the correct value is 0. Suppose that A = 0 = Stt and that the loss 
function is quadratic, L(9,a) = (9 — a) 2 . This loss has been a long-time favorite 
because it leads to easily interpretable analytic results. In fact its use goes back at 
least as far as Gauss. With quadratic loss, the risk function can be broken down into 
two pieces as 


R(6,8) 


L 

L 

L 


L(9,8) f(x\9)dx 
(8(x) - 6ff(x\d)dx 

[(<5(x) - E[<$(x)|0]) + (E[«5(x)|0] - 0)] 2 /(x|0)dx 


Var[<5(x)|0] + [£[,5(x)|0] - ef. 


(7.19) 


The first term in the decomposition is the variance of the decision rule and the second 
term is its bias, squared. 

Another interesting fact is that the Bayes rule is simply 8*(x ) = E[9 |x], that is 
the posterior mean. This is because the posterior expected loss is 


£Ja) = 


a) 2 n x {9)d9 


/ (0 ~ 

= [ [(9 - E[9\x]) + (a-E[9\x]]\ 2 TT x (9)d9 

Jm 

= I (9- E[9\x]) 2 7T x (9)d9 + (a - E[9\x\) 2 
Jm 

= Vzr[9\x] + (a-E[9\x]) 2 , 


which is minimized by taking a* — E[9\x], Thus, the posterior expected loss associ¬ 
ated with a* is the posterior variance of 9. If instead L(9,a) = \9 — a\, then <5*(x) = 
median of n(9\x). 

A widely accepted estimation paradigm is that of maximum likelihood (Fisher 
1925). The decision-theoretic framework is useful in bringing out the implicit value 
system of the maximum likelihood approach. To illustrate, take a discrete parameter 
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space © = {#!, 0 2 ,...} and imagine that the estimation problem is such that we gain 
something only if our estimate is exactly right. The corresponding loss function is 


L(9,a) — /(a^9j, 


where, again, a represents a point estimate of 6. The posterior expected loss is max¬ 
imized by a* = mode(0 |jc) = 9° (use your favorite tie-breaking rule if there is more 
than one mode). If the prior is uniform on 0 (and in other cases as well), the mode 
of the posterior distribution will coincide with the value of 9 that maximizes the 
likelihood. 

Extending this correspondence to the continuous case is more challenging, 
because the posterior probability of getting the answer exactly right is zero. If we 
set 


L(9, a) — /{la-9126}, 


then the Bayes rule is the value of a that has the largest posterior mass in a neighbor¬ 
hood of size e. If the posterior is a density, and the prior is approximately flat, then 
this a will be close to the value maximizing the likelihood. 

In the remainder of this section we present two simple illustrations. 

Example 7.2 This example goes back to the “secret number” example of 
Section 1.1. You have to guess a secret number. You know it is an integer. You can 
perform an experiment that would yield either the number before it or the number 
after it, with equal probability. You know there is no ambiguity about the experimen¬ 
tal result or about the experimental answer. You can perform the experiment twice. 
More formally, x, and x 2 are independent observations from 


1 


fix= 9 - m =f(x= 9 + m = - 


where 0 are the integers. We are interested in estimating 9 using the loss function 
L(9,a) = I wm . The estimator 



is equal to 9 if and only if Xi f x 2 , which happens in one-half of the samples. 
Therefore R{9, <5 0 ) = 1 /2. Also the estimator 


(5}(xi,x 2 ) = X{ + 1 


is equal to 9 if x x <9, which also happens in one-half of the samples. So R(0, <5,) = 
1 /2. These two estimators are indistinguishable from the point of view of frequentist 
risk. 

What is the Bayes strategy? If Xi f x 2 then 
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and the optimal action is a* = (xi + x 2 )/2. If a, = x 2 , then 


7 r(0 — Xi + l|x 1 ,x 2 ) = 


n(x l + 1) 

7T(X! + 1 ) + 7r(x! - 1 ) 


and similarly, 


7t (0 — Xi — 1|X 1 ,X 2 ) = 


7r(X! — 1) 

7T(X! + 1) + 7r(X! - 1) 


so that the optimal action is x { + 1 if 7r(x! + 1) > 7 t(x! — 1) and X\ — 1 if 7r(xi + 1) < 
jt(xi — 1). If the prior is such that it is approximately Bayes to choose X| + I if 
Xi = x 2 , then the resulting Bayes rule has frequentist risk of 1 /A. * 


Example 7.3 Suppose that x is drawn from a N(6, cr 2 ) distribution with a 2 known. 
Let L{9, a) — (0 — a) 2 be the loss function. We want to study properties of rules of 
the form <$(x) = cx, for c a real constant. The risk function is 


R(0, 5) = Var[cx|<9] + [E[cx|0] - 6] 2 
= c 2 a 2 + (c — 1 ) 2 0 2 . 


First off we can rule out a whole bunch of rules in the family. For example, as illus¬ 
trated in Figure 7.5, all the S with c > 1 are dominated by the c = 1 rule, because 
R(6,x) < R(0,cx), V 0. So it would be foolish to choose c > 1. In Chapter 8 we 
will introduce technical terminology for this kind of foolishness. The rule <5(x) = x 
is a bit special. It is unbiased and, unlike the others, has the same risk no matter what 



Figure 7.5 Risk functions for c = 1/2,1, and 2. The c = 2 rule is dominated by the 
c — 1 rule, while no dominance relation exists between the c = 1/2 and c = 1 rules. 
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9 is. This makes it the unique minimax rule in the class, because all others have a 
maximum risk of infinity. Rules with 0 < c < 1, however, are not dominated by 
S(x) = x, because they have a smaller variance, which may offset the higher bias 
when 9 is indeed small. 

One way of understanding why one may want to use a c that is not 1 is to bring in 
prior information. For example, say tt(9) ~ N(/z 0 , t 0 2 ), where /x 0 and r 2 are known. 
After x is observed, the posterior expected loss is the posterior variance and the Bayes 
action is the posterior mean. Thus, using the results from Chapter 16, 


S*(x) 


(7 2 /U 0 + X 2 X 

a 2 + r 2 


When ju 0 = 0 the decision rule belongs to the family we are studying for 
c = t 0 2 /((7 2 + t 0 2 ) < 1. To see what happens when /z 0 ^ 0 study the class S(x) = 
c 0 + cx. ★ 


7.6.2 Interval inference 

A common practice in statistical analysis is to report interval estimates, to com¬ 
municate succinctly both the likely magnitude of an unknown and the uncertainty 
remaining after analysis. Interval estimation can also be framed as a decision 
problem, in fact Wald was doing so as early as 1939. Rice et al. (2008) review 
decision-theoretic approaches to interval estimation and Schervish (1995) provides a 
through treatment. Rice et al. (2008) observe that loss functions for intervals should 
trade off two competing goals: intervals should be small and close to the true value. 
One of the illustrations they provide is this. If 0 is the parameter space, we need to 
choose an interval a = (a.\, a 2 ) within that space. A loss function capturing the two 
competing goals of size and closeness is a weighted combination of the half distance 
between the points and the “miss-distance,” that is zero if the parameter is in the 
interval, and is the distance between the parameter and the nearest extreme if the 
parameter is outside the interval: 


L(6,a) — L\ — -F L 2 [(fli — 0)+ + (9 — < 22 )+] ■ (7.20) 

Here the subscript + indicates the positive part of the corresponding function: 
g + is g when g is positive and 0 otherwise. If this loss function is used, then the 
optimal interval is to choose a t and a 2 to be the LJ(2L 2 ) and 1 — Li/(2L 2 ) quantiles 
of the posterior distribution of 9. This provides a decision-theoretic justification for 
the common practice of computing equal tail posterior probability regions. The tail 
probability is controlled by the parameters representing the relative importance of 
size versus closeness. Analogously to the hypothesis test case, the same result can 
be achieved by specifying Li/L 2 or the probability a assigned to the two tails in 
total. 

An alternative specification of the loss function leads to a solution based on 
moments. This loss is a weighted combination of the half distance between the 
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points and the distance of the true parameter from the center of the interval. This 
is measured as squared error, standardized to the interval’s half size: 



The Bayes interval in this case is 



Finally, Carlin and Louis (2008) consider the loss function 


L(Q,a) — I e $ a + c x volume(a). 


Here c controls the trade-off between the volume of a and the posterior coverage 
probability. Under this loss function, the Bayes rule is the subset of © including 
values with highest posterior density, or HPD region(s). 

7.7 Minimax-Bayes connections 

When are the Bayes rules minimax? This question has been studied intensely, partly 
from a game-theoretic perspective in which the prior is nature’s randomized decision 
rule, and the Bayes venue allows a minimax solution to be found. From a statistical 
perspective, the bottom line is that often one can concoct a prior distribution that is 
sufficiently pessimistic that the Bayes solution ends up being minimax. Conceptually 
this is important, because it establishes intersection between the set of all possible 
Bayes rules, each with its own prior, and minimax rules (which are often not unique), 
despite the different premises of the two approaches. Technically, matters get tricky 
very quickly. Our discussion is a quick tour. Much more thorough accounts are given 
by Schervish (1995) and Robert (1994). 

The first result establishes that if a rule 8 has a risk R that can be bounded by the 
average risk r of a Bayes rule, then it is minimax. This also applies to limits of r over 
sequences of priors. 

Theorem 7.4 Let 8* k be the Bayes rule with respect to jt k . Suppose that 
lim^oo r(7t k , 8%) — c < oo. If there is 8 M such that R(9,8 M ) < c for all 6, then 8 M 
is minimax. 


Proof: Suppose by contradiction that 8 M is not minimax. Then, there is 8' and c > 0 
such that 
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Take k* such that 


r(n k , SI) > c — e/2, V k > k*. 

Then, for k > k*, 

r(n k , S') = f R(0,8')n k (0)de < (c - e) j n k (0)d0 

J& J 0 

= c — e < c — e/2 < r(iz k , 8* k ), 


but this contradicts the hypothesis that S* k is a Bayes rule with respect to n k . □ 


Example 7.4 Suppose that x lt ... ,x p are independent and distributed as N(0,, 1). 
Let x = (xi,... ,x p )' and 6 = (6 l , ... ,0 P )'. We are interested in showing that the 
decision rule 8 M (x) — x is minimax if one uses an additive quadratic loss function 


L{0,a) = - a// 2 . 


The risk function of <5 is 


L 


R(9,8 M ) = / L(e,S M )f(x\0)dx 


— E x m 


- Xi) 2 


= Var[x,-|0] = p. 


To build our sequence of priors, we set prior n k to be such that the 0 are independent 
normals with mean 0 and variance k. With this prior and quadratic loss, the Bayes 
rule is given by the vector of posterior means, that is 

w = £TT*' 

Note that the (),, i = 1, ... ,p, are independent a posteriori with variance k/(k + 1). 
Therefore, the Bayes risk of <5* is 


r(7t k , S* k ) — p 


k 

k+ r 


lim r(7t k , S* k ) = p. 
k —>-oo 


Taking the limit, 
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The assumptions of Theorem 7.4 are all met and therefore S M is minimax. This 
example will be developed in great detail in Chapter 9. ★ 

We now give a more formal definition of what it means for a prior to be pes¬ 
simistic for the purpose of our discussion on minimax: it means that that prior implies 
the greatest possible Bayes risk. 

Definition 7.9 A prior distribution jt M for 9 is least favorable if 

infr(7r M ,<5) = sup inf r(n, 5). (7.21) 

<s *• a 

jr M is also called the maximin strategy for nature. 

Theorem 7.5 Suppose that 8* is a Bayes rule with respect to n M and such that 

r(jc M , <5*) = / R{9,8*)n M (9)d9 = sup R{6,8*). 

Je e 

Then 

1. 8* is a minimax rule. 

2. If 8* is the unique Bayes rule with respect to tc, then 8* is the unique minimax 
rule. 

3. jr M is the least favorable prior. 

Proof: To prove 1 and 2 note that 

sup R(6,8*) = I R(9,8*)n M (9)d9 
o Je 

< I R(0,8)7T M (6)d0, 

Je 

where the inequality is for any other decision rule 8, because <5* is a Bayes rule with 
respect to it M . When 8 M is the unique Bayes rule a strict inequality holds. Moreover, 

I R(9,S)7t M (6)d6 < sup R(6,8). 

Je e 

Thus, 

sup R(9,8*) < supflOM), V 8 e V 

e e 


and S M is a minimax rule. If 8* is the unique Bayes rule, the strict inequality noted 
above implies that 5* is the unique minimax rule. 
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To prove 3, take any other Bayes rule 8 with respect to a prior distribution n. 
Then 


r(n,8) = / R(0,8)n(9)d0 


L 

L 


R(9,8*)n{9)d9 

* 

< sup/?(0, 8*) 

e 

= r(it M ,8*), 


that is tt m is the least favorable prior. □ 

Example 7.5 Take x to be binomial with unknown 9, assume a quadratic loss 
function L(9, a) = (9 — a) 2 , and assume that 9 has a priori Beta(a 0 , 0 O ) distribution. 
Under quadratic loss, the Bayes rule, call it 8*, is the posterior mean 

rw = “° +J1 . 

a o + 0 o + n 


and its risk is 

R(0,8*) = - - 1 {0 2 [(a o + Pof - n] + 9[n - 2« 0 (« 0 + p 0 )) + « 0 2 } . 

(a 0 + p 0 + n) 2 ' 

If a 0 = p 0 = ~Jn/2, we have 


8 M (x) = 


x + ~Jn/2 
n + s/n 


x JJi 1 

-—- 1 -. 

n 1 + y/n 2(1 + yfn) 


This rule has constant risk 


R(9,8 M ) 


1 

4 + 8 Jn + 4 n 


1 

4(1 + ~Jn ) 2 ' 


(7.22) 


(7.23) 


Since the risk is constant, R(9,8 M ) = r(ji M ,8 M ) for all 9, and ix M is a 
Betai^/n/2, *Jn/2) distribution. By applying Theorem 7.5 we conclude that 8 M is 
minimax and jt M is least favorable. 

The maximum likelihood estimator 8, under quadratic loss, has risk 9( 1 — 9)/n 
which has a unique maximum at 9 = 1/2. Figure 7.6 compares the risk of the max¬ 
imum likelihood estimator 5 to that of the minimax estimator 8 M for four choices 
of n. For small values of n, the minimax estimator is better for most values of the 
parameter. However, for larger values of n the improvement achieved by the minimax 
estimator is negligible and limited to a narrow range. * 


Example 7.6 We have seen how to determine minimax rules for normal and 
binomial data. Lehmann (1983) shows how to use these results to derive minimax 
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Figure 7.6 Risk Junctions for n — 1,10,100,1000 as a function of 6. In each panel, 
the solid line shows the risk function for the maximum likelihood estimator 8, while 
the dotted line shows the flat risk function for the minimax estimator 8 M . 


estimators for the means of arbitrary distributions, under a much more general set¬ 
ting than the parametric assumptions considered so far. The general idea is captured 
in the following lemma. 

Lemma 2 Let x be a random quantity with unknown distribution F, and let g(F) be 
a functional defined over a set T\ of distributions F. Suppose that 8 M is a minimax 
estimator of g(F) when F is restricted to some subset JT 0 of T\. Then if 

sup R(F, 8) = sup R(F, 8), (7.24) 

F&J -o F^J-\ 


then 8 M is minimax also when F is permitted to vary over T\. 

See Lehmann (1983) for details. We will look at two examples where the functional 
g is the mean of F. Let x u ... ,x n be independent and identically distributed obser¬ 
vations from F, and such that F[x,\0 1 — 0 < oo, for i = 1We consider the 
problem of estimating 0 under quadratic loss L(Q,a) = (6 — a) 2 . 
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For a real-valued observation, we impose the restriction of bounded variance of 
F, that is 

Var[x,|(9] = cr 2 < oo. (7.25) 

Distributions with bounded variance will constitute our class T\. Bounded variance 
will guarantee that the maximum risk of reasonable estimators of 9 is finite. We 
choose T 0 to be the family of normal distributions for which restriction (7.25) also 
holds. Similarly to Example 7.4, assume that the prior distribution of 9 is AT/r. 0 ,k). 
Under quadratic loss the Bayes estimator is the posterior mean and the Bayes risk is 
the posterior variance. This gives 

nxI a 1 + /z 0 /k 
n/a 2 + l/k 

1 

n/o 2 + l/k 

Now focus on 8 M = x. Since lim^^ r(n k , 8* k ) = a 1 In and R(9,S M ) = a 2 /n, by 
applying Theorem 7.4 we conclude that 3c is minimax for T {) . Using the lemma, 8 M is 
minimax for T\ . 

Now take T\ to be the class of distributions F such that F( 1) — /TO) = 1. These 
distributions have bounded support, a condition which allow us to work with finite 
risk without bounding the variance. Let F 0 be the Bernoulli family of distributions. 
These are such that/(x, = 1| 9) = 9 and/(x, = O|0) —1—9, where 0 < 9 < 1. Let 
us consider the estimator 


K(x) = 

r(7t k , S * k ) : 



1 

2(1 +Vi))' 


As we saw in Example 7.5, 8 M is the minimax estimator of 0 as F varies in T n . To 
prove that 8 M is minimax with respect to T \, by virtue of the lemma, it is enough to 
show that the risk R{9,8 M ) takes on its maximum over Observe that the risk is 


R(F, 8 M ) = E 


■s/n _ 1 

- X “I”- 

1 + V” 2(1 + v«) 



By adding and subtracting s/n/(l + *Jti)9 we obtain 


R(F, 8 M ) 



Var[x|6»] + 



Observe that 0 < x < 1. Then.x 2 < xand Var[x|0] = ^[x 2 \9] — 9 2 < L[x|0] — 9 2 . 
Therefore 


1 


R(F, 8 M ) < 


4(1 + ~Jn) 2 ' 
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As seen in Example 7.5, the quantity 1/4(1 + y/n) 2 is the risk of S M over T 0 . Since the 
risk function of S M takes its maximum over we can conclude that S M is minimax 
for the broader family T\. * 

We close our discussion with a general sufficient condition under which minimax 
rules and least favorable distributions exist. 


Theorem 7.6 (The minimax theorem) Suppose that the loss function L is bounded 
below and © is finite. Then 

sup inf r{n ,&) — inf sup R{9, <5) 

it 4 s e 

and there exists a least favorable distribution ic M . If R is closed from below, then 
there is a minimax rule that is a Bayes rule with respect to n M . 


See Schervish (1995) for a proof. Weiss (1992) elaborates on the significance of this 
result within Wald’s theory. The following example illustrates that the theorem does 
not necessarily hold if © is not finite. 


Example 7.7 (Ferguson 1967) Suppose that 6 = A = {1,2,...}, in a decision 
problem where no data are available and the loss function is 


L(9, a) 


1 if a < 9, 

0 if a — 9, 

— 1 if a > 8. 


If we think of this as a game between the Statistician and Nature, both have to 
pick an integer, and whoever chooses the largest integer wins the game. Priors can 
formally be thought of as randomized strategies for Nature, so n ($,) is the probability 
of Nature choosing integer i. We have r(jr, a) = E„ [L(9,a)\ = P(9 > a) — P(0 < a). 
Therefore, sup^ inf a r(ic , a) = —1, which differs from inf,, sup,, R(0, a) = 1. Thus, 
we do not have an optimal minimax strategy for this game. ★ 


7.8 Exercises 

Problem 7.1 Lindley reports this interesting gastronomical conundrum with regard 
to minimax: 

You enter a restaurant and you study the menu: let us suppose for sim¬ 
plicity it only contains two items, duck and goose. After reflection you 
decide on goose. However, when the waiter appears he informs you 
that chicken is also available. After further reflection and the use of the 
minimax method you may decide on duck. It is hard to see how the avail¬ 
ability of a third possibility should make you change your selection from 
the second to the first of the original alternatives, but such a change can 
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occur if the minimax method is used. The point to be emphasized here 
is that two decision problems have been posed and solved and yet the 
two results do not hang together, they do not cohere. (Lindley 1968b, 
pp. 317-318) 

Construct functions u, and L„ and L, that provide a numerical illustration of Lindley’s 
example. This paradox is induced by the transformation to regret, so you only need 
to go as far as illustrating the paradox for L. 

Problem 7.2 You are collecting n normal observations with mean 9 and standard 
deviation 1 because you are interested in testing 9 = 9 0 = — 1 versus 0=6, = I. 
Using the notation of Section 7.5.1: 

1. Find the set of values of jr, L 0 , and C that give you the same decision rule as 
the most powerful a level test, with a = 0.04. 

2. Can you find values of jr, L 0 , and L, that give you the same decision rule as 
the most powerful a level test, with a — 0.04, for both a sample of size n and 
a sample of size 2 nl 

3. (Optional) If your answer to the second question is no, that means that using 
the same a irrespective of sample size violates some of the axioms of expected 
utility theory. Can you say which? 

Problem 7.3 Testing problems are frequently cast in terms of comparisons between 
a point null hypothesis, that is a hypothesis made up of a single point in the parameter 
space, against a composite alternative, including multiple points. Formally, © 0 = 9 0 
and 0[ ^ f) lh A standard Bayesian approach to this type of problem is to specify a 
mixture prior for 9 that assigns a point mass jt 0 to the event 0 = 0„ and a continuous 
distribution e ©i). This is sometimes called “point and slab” prior and goes 

back to Jeffreys (1961). Given data x ~ f(x\0), find the Bayes rule in this setting. 
Read more about this problem in Berger & Delampady (1987). 

Problem 7.4 You are collecting normal data with mean 0 and standard deviation 
1 because you are interested in testing 9 < 0 versus 9 > 0. Your loss is such that 
mistakes are penalized differently within the alternative compared to the null, that is 

L(9,a 0 ) — L o |0|l(0>o), 

L(9,a\) — Li|0|1( 9<O ). 

1. Assume a uniform prior on 9 and use Bayes’ theorem to derive the posterior 
distribution. 

2. Find the posterior expected losses of a 0 and a l . 

3. Find the Bayes decision rule for this problem. 
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Problem 7.5 Find the least favorable priors for the two sampling distributions in 
Example 7.1. Does it bother you that the prior depends on the sampling model? 

Problem 7.6 In order to choose an action a based on the loss L{9, a), you elicit the 
opinion of two experts about the probabilities of the various outcomes of 9 (you can 
assume if you wish that 9 is a discrete random variable). The two experts give you 
distributions n\ and tt 2 . Suppose that the action a* is Bayes for both distributions 7t\ 
and t r 2 . Is it true that a* must be Bayes for all weighted averages of tt, and jr 2 ; that 
is, for all the distributions of the form (xtt i + (1 — a)jr 2 , with 0 < a < 1? 

Problem 7.7 Prove that the Bayes rule under absolute error loss is the posterior 
median. More precisely, suppose that £[101] < oo. Show that a number a* satisfies 

E[\9-a*\\= inf E[\0 - a\] 


if and only if a* is a median of the distribution of 0. 

Problem 7.8 Let 0 be a random variable with distribution n{9) that is symmetric 
with respect to 9 0 . Formally, tt( 6 + 0 o ) = tt(Qq — 0) for all 9 e R. Suppose that 
£ is a nonnegative twice differentiable convex loss function on the real line that is 
symmetric around the value of 0. Also suppose that, for all values of a, 





Prove that: 

1. £ is convex; 

2. C is symmetric with respect to 9 0 ; 

3. £ is minimized at a — 9 0 . 

Problem 7.9 Derive an analogous equation to (7.17), using the same assumptions 
of Section 7.5.1, except now the sampling distribution is f(x\9,cp), and the prior is 
n{9,(p). 

Problem 7.10 Consider a point estimation problem in which you observe x\,...,x n 
as i.i.d. random variables from the Poisson distribution 



Assume a squared error of estimation loss L(9,a) = (a — 0) 2 , and assume a prior 
distribution on 9 given by the gamma density 
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1. Show that the Bayes decision rule with respect to the prior above is of the 
form 

<5*Oi,... ,x„) = a + bx 

where a > 0, b e (0,1), and x = J2i x Vn- You may use the fact that the 
distribution of x, is Poisson with parameter nO without proof. 

2. Compute and graph the risk functions of R'fx,,..., x„) and that of the MLE 
S(xi,...,x n ) = x. 

3. Compute the Bayes risk of R'(X],... ,x„) and show that it is (a) decreasing in 
n and (b) it goes to 0 as n gets large. 

4. Suppose an investigator wants to collect a sample that is large enough that the 
Bayes risk after the experiment is half of the Bayes risk before the experiment. 
Find that sample size. 

Problem 7.11 Take x to be binomial with unknown 6 and n = 2, and consider 
testing 9 0 = 1/3 versus Q\ — 1 /2. Draw the points corresponding to every possible 
decision rule in the risk space with coordinates R(6 0 ,S ) and R(() t , S). Identify min¬ 
imax randomized and nonrandomized rules and the expected utility rule for prior 
jt{0o) = 1 /4. Identify the set of rules that are not dominated by any other rule (those 
are called admissible in Chapter 8). 

Problem 7.12 

An engineer draws a random sample of electron tubes and measures the 
plate voltage under certain conditions with a very accurate voltmeter, 
accurate enough so that measurement error is negligible compared with 
the variability of the tubes. A statistician examines the measurements, 
which look normally distributed and vary from 75 to 99 volts with a mean 
of 87 and a standard deviation of 4. He makes the ordinary normal anal¬ 
ysis, giving a confidence interval for the true mean. Later he visits the 
engineer’s laboratory, and notices that the voltmeter used reads only as 
far as 100, so the population appears to be “censored.” This necessitates 
a new analysis, if the statistician is orthodox. However, the engineer says 
he has another meter, equally accurate and reading to 1000 volts, which 
he would have used if any voltage had been over 100. This is a relief 
to the orthodox statistician, because it means the population was effec¬ 
tively uncensored after all. But the next day the engineer telephones and 
says: “I just discovered my high-range voltmeter was not working the 
day I did the experiment you analyzed for me.” The statistician ascer¬ 
tains that the engineer would not have held up the experiment until the 
meter was fixed, and informs him that a new analysis will be required. 

The engineer is astounded. He says: “But the experiment turned out just 
the same as if the high-range meter had been working. I obtained the 


154 


DECISION THEORY: PRINCIPLES AND APPROACHES 


precise voltages of my sample anyway, so 1 learned exactly what I would 
have learned if the high-range meter had been available. Next you’ll be 
asking me about my oscilloscope.” (From Pratt’s comments to Birnbaum 
1962, pp. 314-315) 

State a probabilistic model for the situation described by Pratt, and specify a prior 
distribution and a loss function for the point estimation of the mean voltage. Writing 
code if necessary, compute the risk function R of the Bayes rule and that of your 
favorite frequentist rule in two scenarios: when the high-range voltmeter is available 
and when it is not. Does examining the risk function help you select a decision rule 
once the data are observed? 


8 


Admissibility 


In this chapter we will explore further the concept of admissibility. Suppose, 
following Wald, we agree to look at long-run average loss R as the criterion of inter¬ 
est for choosing decision rules. R depends on 9, but a basic requirement is that one 
should not prefer a rule that does worse than another no matter what the true 9 is. This 
is a very weak requirement, and a key rationality principle for frequentist decision 
theory. There is a similarity between admissibility and the strict coherence condition 
presented in Section 2.1.2. In de Finetti’s terminology, a decision maker trading a 
decision rule for another that has higher risk everywhere could be described as a sure 
loser—except for the fact that the risk difference could be zero in some “lucky” cases. 

Admissible rules are those that cannot be dominated; that is, beaten at the risk 
game no matter what the truth is. Admissibility is a far more basic rationality 
principle than minimax in the sense that it does not require adhering to the “ultra- 
pessimistic” perspective on 9. The group of people willing to take admissibility 
seriously is in fact much larger than those equating minimax to frequentist ratio¬ 
nality. To many, therefore, characterization of sets of admissible decision rules has 
been a key component of the contribution of decision theory to statistics. 

It turns out that just by requiring admissibility one is drawn again towards an 
expected utility perspective. We will discover that to generate an admissible rule, all 
one needs is to find a Bayes rule. The essence of the story is that a decision rule 
cannot beat another everywhere and come behind on average. This needs to be made 
more rigorous, but, for example, it is enough to require that the Bayes rule be unique 
for it to work out. A related condition is to rule out “know-it-all” priors that are not 
ready to change in the light of data. This general connection between admissibil¬ 
ity and Bayes means there is no way to rule out Bayes rules from the standpoint 
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of admissibility, even though admissibility is based on long-run performance over 
repeated experiments. 

Even more interesting from the standpoint of foundations is that, in certain fairly 
big classes of decision problems, all admissible rules are Bayes rules or limiting 
cases of the Bayes rule. Not only can the chief frequentist rationality requirement 
not rule any Bayesian out, but also one has to be Bayesian (or close) to satisfy it 
in the first place! Another way to think about this is that no matter what admissible 
procedure one may concoct, somewhere behind the scenes, there is a prior (or a 
limit of priors) for which that is the expected utility solution. A Bayesian perspective 
brings that into the open and contributes to a more forthright scientific discourse. 

Featured articles: 

Neyman, J. & Pearson, E. S. (1933). On the problem of the most efficient test of 
statistical hypotheses, Philosophical Transaction of the Royal Society (Series A) 
231: 286-337. 

This feature choice is a bit of a stretch in the sense that we do not dwell on the 
Neyman-Pearson theory at all. However, we discover at the end of the chapter that 
the famed Neyman-Pearson lemma can be reinterpreted as a complete class theorem 
in the light of all the theory we developed so far. The Neyman-Pearson theory was 
the spark for frequentist decision theory, and this is a good place to appreciate its 
impact on the broader theory. 

There is a vast literature on the topics covered in this chapter, which we only 
briefly survey. More extensive accounts and references can be found in Berger 
(1985), Robert (1994), and Schervish (1995). 

8.1 Admissibility and completeness 

If we take the risk function as the basis for comparison of two decision rules 8 and 
<$', the following definition is natural: 

Definition 8.1 (R-better) A decision rule 8 is called R-better than another decision 
rule 8' if 

R(0,8)<R(6,8') V 6 e @ (8.1) 

and R(6,8) < R(9,8') for some 9. We also say that 8 dominates 8'. 

Figure 8.1 shows an example. If two rules have the same risk function they are called 
R-equivalent. 

If we are building a toolbox of rules for a specific decision problem, and 8 is 
R-better than S', we do not need to include S' in our toolbox. Instead we should focus 
on rules that cannot be “beaten” at the R-better game. We make this formal with 
some additional definitions. 

Definition 8.2 (Admissibility) A decision rule 8 is admissible if there is no R-better 
rule. 
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Figure 8.1 Risk functions for decision rules 8 and 8'. Here 8 dominates 8': the two 
risk functions are the same for 6 close to 1 but the risk of 8 is lower near 0 and never 
greater. 

Referring back to Figure 7.2 from Chapter 7, a 5 is R-better than a 6 , which is 
therefore not admissible. On the other hand, a 4 is admissible as long as only nonran- 
domized actions are allowed, even though it could never be chosen as the optimum 
under either principle. When randomized rules are allowed, it becomes possible to 
decrease losses from y 4 in both dimensions, and a 4 becomes inadmissible. It is true 
that Figure 7.2 talks about losses and not risk, but loss is a special case of risk when 
there are no data available. 

In general, admissibility will eliminate rules but will not determine a unique win¬ 
ner. For example, in Figure 7.2, there are five admissible rules among nonrandomized 
ones, and infinitely many admissible randomized rules. Nonetheless admissibility 
has two major strengths that place it in a strategic position in decision theory: it can 
be used to rule out decision rules, or entire approaches, as inconsistent with rational 
behavior; and it can be used to characterize broad families of rules that are sure to 
include all possible admissible rules. This is implemented via the notion of com¬ 
pleteness, a property of classes of decision rules that ensures that we are not leaving 
out anything that could turn out to be useful. 

Definition 8.3 (Completeness) We give three variants: 

1. A class T> of decision rules is complete if for any decision rule 8 fL T>, there 
is a decision rule 8' e T> which is R-better than 8. 

2. A class T> of decision rules is essentially complete if for any decision rule 
8 f T>, there is a decision rule 8' e T> such that R{6,8') < R(6,8)for all 6; 
that is, 8' is R-better or R-equivalent to 8. 

3. A class V of decision rules is said to be minimal (essentially) complete ifV 
is complete and if no proper subset ofD is (essentially) complete. 
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From this definition it follows that any decision rule 8 that is outside of a com¬ 
plete class is inadmissible. If the class is essentially complete, then a decision rule 8 
that is outside of the class may be admissible, but one can find a decision S' in the 
class with the same risk function. Thus, it seems reasonable to restrict our attention 
to a complete or essentially complete class. 

Complete classes contain all admissible decision rules, but they may contain 
inadmissible rules as well. A stronger connection exists with minimal complete 
classes: these classes are such that, if even one rule is taken out, they are no longer 
complete. 

Theorem 8.1 If a minimal complete class exists, then it coincides with the class of 
all admissible decision rules. 

We left the proof for you (Problem 8.3). 

Complete class results are practical in the sense that if one is interested in admis¬ 
sible strategies, one can safely restrict attention to complete or essentially complete 
classes and study those further. The implications for foundations are also far reach¬ 
ing. Complete class results are useful for characterizing statistical approaches at a 
higher level of abstraction than single decision rules. For example, studying the class 
of all tests based on a threshold of the likelihood ratio, one can investigate the whole 
Neyman-Pearson theory from the point of view of rationality. For another example, 
we can explore whether it is necessary to use randomized decision rules: 

Theorem 8.2 If the loss function L(9,a) is convex in a, then the class of non- 
randomized decision rules T> is complete. 

We left this proof for you as well (Problem 8.8). 


8.2 Admissibility and minimax 

Unfortunately for minimax aficionados, this section is very short. In general, to prove 
admissibility of minimax rules it is best to hope they are close enough to Bayes or 
limiting Bayes and use the theory of the next section. This will not always be the 
case: a famously inadmissible minimax rule is the subject of Chapter 9. 

On the flip side, a very nice result is available for proving minimaxity once you 
already know a rule is admissible. 

Theorem 8.3 If 8 M has constant risk and it is admissible, then S M is the unique 
minimax rule. 


Problem 8.11 asks you to prove this and is a useful one to quickly work out before 
reading the rest of the chapter. 

For an example, say x is a Bin{n, 6) and 


L{6, a) 


{9 — a) 2 

ea-e )' 
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Consider the rule 8*(x) = x/n. The risk function is 

1 

n' 

that is 8* has constant risk. With a little work you can also show that S* is the unique 
Bayes rule with respect to a uniform prior on (0,1). Lastly, using the machinery of the 
next section we will be able to prove that 8* is admissible. Then, using Theorem 8.3, 
we can conclude that it is minimax. 


R(0,8*) = E ; 


x\» 


(<9 - x/n) 2 
0(1-d) 


8.3 Admissibility and Bayes 

8.3.1 Proper Bayes rules 

Let us now consider more precisely the conditions for the Bayes rules to be admis¬ 
sible. We start with the easiest case, when © is discrete. In this case, a sufficient 
condition is that the prior should not completely rule out any of the possible 
values of 0. 

Theorem 8.4 If 0 is discrete, and the prior tt gives positive probability to each 
element of®, then the Bayes rule with respect to Tt is admissible. 

Proof: Let 8* be the Bayes rule with respect to jt, and suppose, for a contradiction, 
that 8* is not admissible; that is, there is another rule, say 8, that is /^-better than <5*. 
Then R(0,8) < R(9,8*) with strict inequality R(0 O ,8) < R(6 0 ,8 *) for some 0 0 in 0 
with positive mass n(0 o ) > 0. Then, 

r(7t,8) = ^/?(0,S)jr(0) < ^fl(0,<5*M0) = r(n,8*). 

see see 

This contradicts the fact that 8* is Bayes. Therefore, 8* must be admissible. □ 

Incidentally, from a subjectivist standpoint, it is a little bit tricky to accept the 
fact that there may even be points in a discrete © that do not have a positive mass: 
why should they be in © in the first place if they do not? This question opens a 
well-populated can of worms about whether we can define 0 independently of the 
prior, but we will close it quickly and pretend, for example, that we have a reasonably 
objective way to define © or that we can take 0 to be the union of the supports over 
a group of reasonable decision makers. 

An analogous result to Theorem 8.4 holds for continuous 0, but you have to start 
being careful with your real analysis. An easy case is when n gives positive mass to 
all open sets and the risk function is continuous. See Problem 8.10. The plot of the 
proof is very similar to that used in the discrete parameter case, with the difference 
that, to reach a contradiction, we need to use the continuity condition to create a 
lump of dominating cases with a strictly positive prior probability. 

Another popular condition for guaranteeing admissibility of Bayes rules is 
uniqueness. 
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Theorem 8.5 Any unique Bayes estimator is admissible. 

The proof of this theorem is left for you as Problem 8.4. To see the need for 
uniqueness, look at Problem 8.9. 

Uniqueness is by no means a necessary condition: the important point is that 
when there is more than one Bayes rule, all Bayes rules share the same risk function. 
This happens for example if two Bayes estimators differ only on a set S such that 
P e (S) = 0, for all 9. 

Theorem 8.6 Suppose that every Bayes rule with respect to a prior distribution, 
jt, has the same risk function. Then all these rules are admissible. 

Proof: Let 8* denote a Bayes rule with respect to n and R(0, 8*) denote the risk of 
any such Bayes rule. Suppose that there exists 8„ such that R(0,8 0 ) < R{9,8) for all 
0 and with strict inequality for some 9. This implies that 

I R(6,8 0 )TT(9)d& < I R{9,8*)n{9)d9. 

Jo Jo 

Because 8* is a Bayes rule with respect to it, the inequality must be an equality and 
<S 0 is also a Bayes rule with risk R(0,8*), which leads to a contradiction. □ 

8.3.2 Generalized Bayes rules 

Priors do not sit well with frequentists, but being able to almost infallibly gener¬ 
ate admissible rules does. A compromise that sometimes works is to be lenient on 
the requirement that the prior should integrate to one. We have seen an example in 
Section 7.7. A prior that integrates to something other than one, including perhaps 
infinity, is called an improper prior. Sometimes you can plug an improper prior into 
Bayes’ rule and get a proper posterior-like distribution, which you can use to figure 
out the action that minimizes the posterior expected loss. What you get out of this 
procedure is called a generalized Bayes rule: 

Definition 8.4 If n is an improper prior, and 8* is an action which minimizes the 
posterior expected loss E e ^\L(0 , <5(x))], for each x with positive predictive density 
m(x), then 8* is a generalized Bayes rule. 

For example, if you are estimating a normal mean 9, with known variance, and 
you assume a prior it(9) = 1, you can work out the generalized Bayes rule to be 
8*(x) = x. This is also the maximum likelihood estimator, and the minimum vari¬ 
ance unbiased estimator. The idea of choosing priors that are vague enough that the 
Bayes solution agrees with common statistical practice is a convenient and practical 
compromise in some cases, but a really tricky business in others (Bernardo and Smith 
1994, Robert 1994). Getting into details would be too much of a digression for us 
now. What is more relevant for our discussion is that a lot of generalized Bayes rules 
are admissible. 
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Theorem 8.7 If © is a subset of JR* such that every neighborhood of every 
point in © intersects the interior of ©, R(9,S) is continuous in 9 for all S, the 
Lebesgue measure on © is absolutely continuous with respect to n, 5* is a general¬ 
ized Bayes rule with respect to n, and L(9, S*(x))f(x\9 ) is v x n integrable, then < 5 * is 
admissible. 

See Schervish (1995) for the proof. A key piece is the integrability of L times/, which 
gives a finite Bayes risk r. The other major condition is continuity. Nonetheless, this 
is a very general result. 

Example 8.1 (Schervish 1995, p. 158) Suppose x is a Bin(n, 9) and 9 e © = A = 
[0, 1], The loss is L(9,a) — (9 — a) 2 and the prior is ix(9) = \/[9{\ — 9)], which is 
improper. We will show that Six) = x/n is admissible using Theorem 8.7. There is 
an easier way which you can learn about in Problem 8.6. 

The posterior distribution of 9, given x, is Betaix. n — x). Therefore, for x = 
1,2,... ,n — 1, the (generalized) Bayes rule is S*(x) = x/n, that is the posterior 
mean. Now, if x = 0, 



which is finite only if <5*(x) = 0. Similarly, if x = n. 



which is finite only if S *(x) = 1. Therefore, S'(x) — x/n is the generalized Bayes rule 
with respect to tz . Furthermore, 



has prior expectation 1/n. Therefore, S* is admissible. If you are interested 
in a general characterization of admissible estimators for this problem, read 


Johnson (1971). 


★ 


The following theorem from Blyth (1951) is one of the earliest characterizations 
of admissibility. It provides a sufficient admissibility condition by relating admis¬ 
sibility of an estimator to the existence of a sequence of priors using which we 
obtain Bayes rules whose risk approximates the risk of the estimator in question. 
This requires continuity conditions that are a bit strong: in particular the decision 
rules with continuous risk functions must form a complete class. We will return to 
this topic in Section 8.4.1. 


Theorem 8.8 Consider a decision problem in which © has positive Lebesgue mea¬ 
sure, and in which the decision rules with continuous risk functions form a complete 
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class. Then an estimator S 0 (with a continuous risk function) is admissible if there 
exists a sequence {jr*}^ of (generalized) priors such that: 

1. r(7T„, <S 0 ) and r(jr„, 8*) are finite for all n, where 8* is the Bayes rule with respect 

tO JT n ,' 

2. for any nondegenerate convex set C C 0, there exists a positive constant K 
and an integer N such that, for n > N, 

J ir„(9)d9 > K\ 

3. lim J1 _ 0O [r(7r„, 8 0 ) - r(n n , 5;)] = 0. 

Proof: Suppose for a contradiction that <5 0 is not admissible. Then, there must exist 
a decision rule S' that dominates S 0 , and therefore a 9 0 such that R(9 0 ,8') < R(9 0 , <$ 0 ). 
Because the rules with continuous risk functions form a complete class, both S 0 and 
8' have a continuous risk function. Thus, there exists > 0, e 2 > 0 such that 

R(9,S 0 )-R(9,S’) > 6j, 

for 9 in an open set C = {9 e 0 : \9 — 9 0 \ < e 2 }. 

From condition 1, r(n„, 8*) < r(n„, 8’), because 8* is a Bayes rule with respect to 
n„. Thus, for all n > N, 


r(n„, 8 0 ) - r(jr„, > r(n n , 8 0 ) - r(n lt , 8') 

= E e [R(9,8 0 )-R(9,8')] 

> J[R(9,S 0 )- R(9,S’)]7T n (9)d9 

> ei J 7t„(9)d9 

> 6l K, 

from condition 2. That r(jt„, 8 0 ) — r(n„, 8„) > e t K is in contradiction with condition 3. 
Thus we conclude that 8 0 must be admissible. □ 

Example 8.2 To illustrate the application of this theorem, we will consider a sim¬ 
plified version of Blyth’s own example on the admissibility of the sample average for 
estimating a normal mean parameter. See also Berger (1985, p. 548). Suppose that 
x ~ N(0, 1) and that it is desired to estimate 9 under loss L(9, a) = (9 — a) 2 . We will 
show that 8 0 (x) = x is admissible. A convenient choice for : r„ is the unnormalized 
normal density 

tt„(9) = (2^-)- 1/2 exp(-0.5 9 2 /n). 

The posterior distribution of 9, given x, is N(xn/(n + 1),«/(« + 1)). Therefore, 
with respect to jr„, the generalized Bayes decision rule is 8*(x) = xn/(n + 1), that is 
the posterior mean. 


ADMISSIBILITY 


163 


For condition 7, observe that 

R(0,8 O ) = ( L(0,8 0 (x))f(x\e)dx 
J x 

= E[(Q-x) 2 \Q] = Var[x|6»] = 1. 


Therefore, 

r(jT,„S 0 )= f R(0,8 0 )n n (0)d6 = f jt n (0)d0 = *Jn. 

J® J 0 

Here yjn is the normalizing constant of the prior. Similarly, 


R{e,8* n ) = E 



2 


e 2 

(n + 1 ) 2 + 


(n+ l) 2 ' 


Thus, we obtain that 


r(n n ,8;)= f R(6, 
Ja 


s*)7t n (d)dO = 


For condition 2, 


J Tt n (Q)d6 > J 7T l (d)dd — K > 0 

as long as C is a nondegenerate convex subset of 0. Note that when C = Stt, K — 1. 
The sequence of proper priors N( 0, l/n) would not satisfy this condition. 

Finally, for condition 3 we have 


lim [r(7T„, <$ 0 ) - r(7t„,8*)] = lim 

n—xx> «—>oo 




lim 

n—>oo 


■s/n 
n + 1 


= 0 . 


Therefore, 5 0 W = x is admissible. * 

It can be shown that 5 0 (jr) = x is admissible when x is bivariate, but this requires 
a more complex sequence or priors. We will see in the next chapter that the natural 
generalization of this result to three or more dimensions does not work. In other 
words, the sample average of samples from a multivariate normal distribution is not 
admissible. 

A necessary and sufficient admissibility condition based on limits of improper 
priors was established later by Stein (1955). 
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Theorem 8.9 Assume that f(x\9) is continuous in 9 and strictly positive over 0. 
Moreover, assume that the loss function L{9,a) is strictly convex, continuous, and 
such that for a compact subset E C ©, 

lim infL(<9, 8) = +oo. 

||<$||->-+oo 6eE 

An estimator S 0 is admissible if and only if there exists a sequence {F„} of increasing 
compact sets such that © = U n F n , a sequence {tc„} of finite measures with support 
F„, and a sequence {<5*} of Bayes estimators associated with jt„ such that: 

1. there exists a compact set E 0 C © such that inf„ tt„(E 0 ) > 1; 

2. ifE C © is compact, sup H n„(E) < +oo; 

3. lim„ r(n„, <5 0 ) - r(jc n , &*f) = 0; 

4. lim „R(0,8*J = R(6,8 o ). 

Stein’s result is stated in terms of continuous losses, while earlier ones we 
looked at were based on continuous risk. The next lemma looks at conditions for 
the continuity of loss functions to imply the continuity of the risk. 

Lemma 3 Suppose that © is a subset of fft m and that the loss function L(6,a) is 
bounded and continuous as a function of 9, for all a & A. Iff{x\9) is continuous in 
9, for every x, the risk function of every decision rule S is continuous. 

The proof of this lemma is in Robert (1994, p. 239). 


8.4 Complete classes 

8.4.1 Completeness and Bayes 

The results in this section relate the concepts of admissibility and completeness to 
Bayes decision rules and investigate conditions for Bayes rules to span the set of all 
admissible decision rules and thus form a minimal complete class. 

Theorem 8.10 Suppose that © is finite, the loss function is bounded below, and 
the risk set is closed from below. Then the set of all Bayes rules is a complete class, 
and the set of admissible Bayes rules is a minimal complete class. 

For a proof see Schervish (1995, pp. 179-180). The conditions of this result mimic 
the typical game-theoretic setting (Berger 1985, chapter 5) and are well illustrated 
by the risk set discussion of Section 7.5.1. The Bayes rules are also rules whose risk 
functions are on the lower boundary of the risk set. 

Generalizations of this result to parameter sets that are general subsets of 91"’ 
require more work. A key role is played again by continuity of the risk function. 
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Theorem 8.11 Suppose that A and 0 are closed and bounded subsets of the 
Euclidean space. Assume that the loss function L(9,a) is a continuous function of 
a for each 0 € 0, and that all decision rules have continuous risk functions. Then 
the Bayes rules form a complete class. 

The proof is in Berger (1985, p. 546). 

Continuity of the risk may seem like a restrictive assumption, but it is commonly 
met and in fact decision rules with continuous risk functions often form a complete 
class. An example is given in the next theorem. An important class of statistical 
problems is those with the monotone likelihood ratio property, which requires that 
for every ()\ < 0 2 , the likelihood ratio 


fix \e 2 ) 

fix |0l) 

is a nondecreasing function of x on the set for which at least one of the densities is 
nonzero. For problems with monotone likelihood ratio, and continuous loss, we only 
need to worry about decision rules with finite and continuous risk: 

Theorem 8.12 Consider a statistical decision problem where X , 0, and A are 
subsets of‘Si with A being closed. Assume that f{x\9) is continuous in 6 and it has 
the monotone likelihood ratio property. If: 

1. L(9 , a) is a continuous function of 6 for every a e A; 

2. L is nonincreasing in a for a <9 and nondecreasing for a > 9; and 

3. there exist two functions K t and K 2 bounded on the compact subsets of&x& 
such that 


L(9 u a ) < Kf9 u 9 2 )L(9 2 ,a) + K 2 (9 U 9 2 ), Vae A 

then decision rules with finite and continuous risk functions form a complete 
class. 

For proofs of this theorem see Ferguson (1967) and Brown (1976). 


8.4.2 Sufficiency and the Rao-Blackwell inequality 

Sufficiency is one of the most important concepts in statistics. A sufficient statistic 
for 9 is a function of the data that summarizes them in such a way that no infor¬ 
mation concerning 9 is lost. Should decision rules only depend on the data via 
sufficient statistics? We have hinted at the fact that Bayes rules automatically do 
that, in Section 7.2.3. We examine this issue in more detail in this section. This is a 
good time to lay out a definition of sufficient statistic. 
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Definition 8.5 (Sufficient statistic) Let x ~ f(x\9). A function t of x is said to 
be a sufficient statistic for 9 if the conditional distribution of x, given t(x) = t, is 
independent of 9. 

The Neyman factorization theorem shows that t is sufficient whenever the 
likelihood function can be decomposed as 


f(x\0) = h{x)g{t{x)\9). 


( 8 . 2 ) 


See, for example, Schervish (1995), for a discussion of sufficiency and this factor¬ 
ization. 

Bahadur (1955) proved a very general result relating sufficiency and rationality. 
Specifically, he showed under very general conditions that the class of decision func¬ 
tions which depend on the observed sample only through a function t is essentially 
complete if and only if t is a sufficient statistic. Restating it more concisely: 

Theorem 8.13 The class of all decision rules based on a sufficient statistic is 
essentially complete. 

The proof in Bahadur (1955) is hard measure-theoretic work, as it makes use of the 
very abstract definition of sufficiency proposed by Halmos and Savage (1949). 

An older, less general result provides a constructive way of showing how to find a 
rule that is A’-better or R-equivalent than a rule that depends on the data via statistics 
that are not sufficient. 

Theorem 8.14 (Rao-Blackwell theorem) Suppose that A is a convex subset of 
91'" and that L(9,a ) is a convex function of a for all 9 e ©. Suppose also that t is 
a sufficient statistic for 9, and that 8 0 is a nonrandomized decision rule such that 
E x \o [ |<5q(x)| ] < oo. The decision rule defined as 


<$i(r) = £,|,[<5 0 (.x)] 


(8.3) 


is R-equivalent to or R-better than 8 0 . 

Proof: Jensen’s inequality states that if a function g is convex, then g(E[x]) < 
E[g(x)\. Therefore, 


L(9,8ft)) = L(9,E xll [8 0 (x)]) < E xl ,[L(6,8 0 (x))] 


(8.4) 


and 


R{d,8 l ) = E l p[L(0,8 l m 


< E tlfl [E xll [L(9,S 0 (x))]] 
= E x]e [L(9,8 0 (x ))] 

= R(M„) 


completing the proof. 


□ 
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This result does not say that the decision rale is any good, but only that it is no 
worse than the <S 0 we started with. 

Example 8.3 (Schervish 1995, p. 153). Suppose that jq,... ,x„ are independent 
and identically distributed as N{0, 1) and that we wish to estimate the tail area to the 
left of c — 0 with squared error loss 


L(<9,a) = (a-<h(c-<9)) 2 . 


Here c is some fixed real number, and a e A = [0,1]. A possible decision rule is the 
empirical tail frequency 


n 



It can be shown that t = x is a sufficient statistic for 6, for example using 
equation (8.2). Since A is convex and the loss function is a convex function of a , 




because Xi\t is N(t, [n — 1 ]/«). 

Because of the Rao-Blackwell theorem, the rule 8 ft) = E^XSqIx) I is A 1 -better 
than 8 0 . In this case, the empirical frequency does not make any use of the functional 
form of the likelihood, which is known. Using this functional form we can bring 
all the data to bear in estimating the tail probability and come up with an estimate 
with lower risk everywhere. Clearly this is predicated on knowing for sure that the 
data are normally distributed. Using the entire set of data to estimate the left tail 
would not be as effective if we did not know the parametric form of the sampling 
distribution. ★ 

8.4.3 The Neyman-Pearson lemma 

We now revisit the Neyman-Pearson lemma (Neyman and Pearson 1933) from the 
point of view of complete class theorems. See Berger (1985) or Schervish (1995) for 
proofs of the theorem. 

Theorem 8.15 (Neyman-Pearson lemma) Consider a simple versus simple 
hypothesis testing problem with null hypothesis H 0 :6 — 6 0 and alternative hypothesis 
Hp.O = 0i. The action space is A = {a 0 ,fli} where a, denotes accepting hypothesis 
Hi (i = 0,1). Assume that the loss function is the 0 — K L loss, that is a correct 
decision costs zero, while incorrectly deciding a, costs K L . 
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Tests of the form 


S(x) = 


y(x) 

0 


if fix |0i) > Kf(x\9 0 ) 
if me,) = Kf(x\9 0 ) 
if moo < Kf(x\e 0 ) 


(8.5) 


where 0 < y(x) < 1 ifO < K < oo and y(x) — 0 if K = 0, together with the test 


Ji iff(x\e 0 ) = o 

|0 / f /( x |@ o )>0 


( 8 . 6 ) 


(corresponding to K — oo above), form a minimal complete class of decision rules. 
The subclass of such tests with y(x) = y (a constant) is an essentially complete 
class. 

For any 0 < a < a*, there exists a test 8 of the form (8.5) or (8.6) with a 0 (8) — a 
and any such test is a most powerful test of size a (that is, among all tests 8 with 
a 0 (8) < a such a test minimizes ch (8)). 


With this result we have come in a complete circle: the Neyman-Pearson theory 
was the seed that started Wald’s statistical decision theory; minimal completeness is 
the ultimate rationality endorsement for a statistical approach within that theory—all 
and only the rules generated by the approach are worth considering. The Neyman- 
Pearson tests are a minimal complete class. Also, for each of these tests we can find 
a prior for which that test is the formal Bayes rule. What is left to argue about? 

Before you leave the theater with the impression that a boring happy ending is 
on its way, it is time to start looking at some of the results that have been, and are 
still, generating controversy. We start with a simple one in the next section, and then 
devote the entire next chapter to a more complicated one. 


8.5 Using the same a level across studies with different 
sample sizes is inadmissible 

An example of how the principle of admissibility can be a guide in evaluating the 
rigor of common statistical procedures arises in hypothesis testing. It is common 
practice to use a type I error probability a , say 0.05, across a variety of applications 
and studies. For example, most users of statistical methodologies are prepared to use 
a = 0.05 irrespective of the sample size of the study. In this section we work out a 
simple example that shows that using the same a in two studies with different sam¬ 
ple sizes—all other study characteristics being the same—results in an inadmissible 
decision rule. 

Suppose we are interested in studying two different drugs, A and B. Efficacy 
is measured by parameters 0 A and 0 B . To seek approval, we wish to test both H 0A : 
9 A = 0 versus H lA : 0 A — and H 0B : 6 B = 0 versus H lB \ 0 B — based on observations 
sampled from populations f(x\9 A ) and/(x|0 B ) that differ only in the value of the 
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parameters. Suppose also that the samples available from the two populations are of 
sizes n A < n B . 

The action space is the set of four combinations of accepting or rejecting the two 
hypotheses. Suppose the loss structure for this decision problem is the sum of the 
two drug specific loss tables 



o 

II 

q? 

o A = e 1 


II 

o 

qS* 

II 

q? 

flo A 

0 

L 0 

Gob 

0 

L 0 

0\A 

u 

0 

CllB 

u 

0 


where L 0 and L, are the same for both drugs. This formulation is similar to the mul¬ 
tiple hypothesis testing setting of Section 7.5.2. Adding the losses is natural if the 
consequences of the decisions apply to the same company, and also make sense from 
the point of view of a regulatory agency that controls the drug approval process. 

The risk function for any decision function is defined over four possible com¬ 
binations of values of 0 A and 0 B . Consider the space of decision functions S = 
(S A (x A ),S B (x B )). These rules use the two studies separately in choosing an action. 
In the notation of Section 7.5.1, a s and /f. are the probabilities of type I and II errors 
associated with decision rule S. So the generic decision rule will have risk function 


e A 


R(9,8) 

0 

0 

L { a SA + LiOisg 
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Li otg A + L 0 f3 SB 

i 

0 
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Consider now the decision rule that specifies a fixed value of a, say 0.05, and 
selects the rejection region to minimize ft subject to the constraint that a is no more 
than 0.05. Define P(a,n) to be the resulting type II error. These rules are admissible 
in the single-study setting of Section 7.5.1. Here, however, it turns out that using the 
same a in both & A (x A ) and 8 B (x B ) is often inadmissible. The reason is that the two 
studies have the same loss structure. As we have seen, by specifying a one implicitly 
specifies a trade-off between type I and type II error—which in decision-theoretic 
terms is represented by Li/L 0 . But this implicit relationship depends on the sample 
size. In this formulation we have an additional opportunity to trade-off errors with 
one drug for errors in the other to beat the fixed a strategy no matter what the true 
parameters are. 

Taking the rule with fixed a as the starting point, one is often able to decrease 
a in the larger study, and increase it in the smaller study, to generate a dominating 
decision rule. A simple example to study is one where the decrease and the increase 
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cancel each other out, in which case the overall type I error (the risk when both drugs 
are null) remains ot. Specifically, call 8 the fixed a = u„ rule, and S' the rule obtained 
by choosing a 0 + e in study A and a 0 — e in study B, and then choosing rejection 
regions by minimizing type II error within each study. The risk functions are 


0A 

0b 

R(0,8 ) 

R(9,S') 

0 

0 

2L 1 ao 

2 Lya 0 

0 

1 

L X U o + L 0 p(a 0 ,n A ) 

Ly(oio + e) + LgP(ao + e,n A ) 

1 

0 

LyOia + LoP(oio, n B ) 

Li(&o — e) + Lofi(piQ — €,n B ) 

1 

1 

L 0 (P(a 0 ,n A ) + P(a 0 ,n B )) 

L 0 (/3(a 0 + e, n A ) + P(a 0 - e, n B )) 


If the risk of S' is strictly less than that of 8 in the (0,1) and (1,0) states, then 
it turns out that it will also be less in the (1,1) state. So, rearranging, sufficient 
conditions for 8' to dominate S are 

1 , \ L i 

-(P(a 0 ,n A ) - P(a 0 + e,n A )) > — 

€ Lq 

-(P(a 0 ,n B ) - P(a 0 - e,n B )) > ~y~- 
e Lq 

These conditions are often met. For example, say the two populations are 
N(9 a ,o 2 ) and N(6 B ,cr 2 ), where the variance a 2 , to keep things simple, is known. 
Then, within each study 


P(a,n) = O ^<£-‘(1 - a) - —0^ . 

When 6 l /a = 1 , n A = 5, and n B = 20, choosing e = 0.001 gives 

U 

3.213 > — 

Lq 

u 

-0.073 >-- 

Lq 

which is satisfied over a useful range of loss specifications. 

Berry and Viele (2008) consider a related admissibility issue framed in the con¬ 
text of a random sample size, and show that choosing a ahead of the study, and then 
choosing the rejection region by minimizing type II error given n after the study is 
completed, leads to rules that are inadmissible. In this case the risk function must be 
computed by treating n as an unknown experimental outcome. The mathematics has 
similarities with the example shown here, except there is a single hypothesis, so only 
the (0,0) and (1,1) cases are relevant. In the single study case, the expected utility 
solution specifies, for a fixed prior and loss ratio, how the type I error probability 





ADMISSIBILITY 


171 


should vary with n. This is discussed, for example, by Seidenfeld et al. (1990b), who 
show that for an expected utility maximizer, da„/dn must be constant with n. Berry 
and Viele (2008) show how varying a in this way can dominate the fixed a rule. See 
also Problem 8.5. 

In Chapter 9, we will discuss borrowing strength across a battery of related prob¬ 
lems, and we will encounter estimators that are admissible in single studies but not 
when the ensemble is considered and losses are added. The example we just dis¬ 
cussed is similar in that a key is the additivity of losses, which allows us to trade off 
errors in one problem with errors in another. An important difference, however, is 
that here we use only data from study A to decide about drug A, and likewise for B, 
while in Chapter 9 will will also use the ensemble of the data in determining each 
estimate. 


8.6 Exercises 


Problem 8.1 (Schervish 1995) Let 0 = (0, oo), A = [0, oo), and loss function 
L(Q,a) = {0 — a) 2 . Suppose also that x ~ 7/(0, 0), given 6, and that the prior for 
0 is the 7/(0, c) distribution for c > 0. Find two formal Bayes rules one of which 
dominates the other. Try your hand at it before you look at the solution below. What is 
the connection between this example and the Likelihood Principle? If x < c, should 
it matter what the rule would have said in the case x > c? What is the connection 
with the discussion we had in the introduction about judging an answer by what it 
says versus judging it by how it was derived? 


Solution 

After observing x < c, the posterior distribution is 


7t(0\x) 


1 

01og(c/x) ^ 


while if x > c the posterior distribution is defined arbitrarily. So the Bayes rules are 
of the form 


8*(x) 


(c-x) 
log (c/x) 
arbitrary 


if x < c 
if x > c 


which is the posterior mean, if x < c. Let <5*(x) denote the Bayes rule which has 
<$*(x) = c for x > c. Similarly, define the Bayes rule <5*(x) which has <5*(x) = x for 
x > c. The difference in risk functions is 


f 1 

R(e,8* 0 )-R(.e,8* 1 )= / {{6 - cf - (0 - xf]-dx > 0. 

Jc V 

Problem 8.2 Suppose T is a sufficient statistic for 0 and let 8* be any randomized 
decision rule in V. Prove that, subject to measurability conditions, there exists a 
randomized decision rule <5f, depending on T only, which is /7-equivalent to 8' t \. 
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Problem 8.3 Prove that if a minimal complete class V exists, then it is exactly the 
class of all admissible estimators. 

Problem 8.4 Prove Theorem 8.5. 

Problem 8.5 (Based on Cox 1958) Consider testing H 0 : 9 = 0 based on a single 
observation x. The probabilistic mechanism generating x is this. Flip a coin: if head 
draw an observation from a N(9, a = 1); if tail from a N(9, a = 10). The outcome 
of the coin is known. For a concrete example, imagine randomly drawing one of two 
measuring devices, one more precise than the other. 

Decision rule S c (a conditional test) is to fix a = 0.05 and then reject H 0 if x > 
1.64 when a = 1 and x > 16.4 when a = 10. An unconditional test is allowed to 
select the two rejection regions based on properties of both devices. Show that the 
uniformly most powerful test of level a = 0.05 dominates 8 C . 

There are at least two possible interpretations for this result. The first says that 
the decision-theoretic approach, as implemented by considering frequentist risk, is 
problematic in that it can lead to violating the Likelihood Principle. The second 
argues that the conditional test is not a rational one in the first place, because it 
effectively implies using a different loss function depending on the measuring device 
used. Which do you favor? Can they both be right? 

Problem 8.6 Let x be the number of successes in n independent trials with prob¬ 
ability of success 9 e (0,1). Show that a rule is admissible for the squared error 
loss 


L(9,a) = (a — 0) 2 

if and only if it is admissible for the “standardized” squared error loss 

(.a — 0) 2 


L s (9,a) = 


9(i-ey 


Is this property special to the binomial case? To the squared error loss? State 
and prove a general version of this result. The more general it is, the better, of 
course. 


Problem 8.7 Take x ~ N(9, 1), 0 ~ /V(0,1), and 

1X9, a) = (9 - a) 2 e w2/ \ 

Show that the formal Bayes rule is <S*(x) = 2x and that it is inadmissible. The reason 
things get weird here is that the Bayes risk r is infinite. It will be easier to think of a 
rule that dominates 2x if you work on Problem 8.6 first. 


Problem 8.8 Prove Theorem 8.2. 
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Problem 8.9 Let x be the number of successes in n = 2 independent trials with 
probability of success 0 e [0,1]. Consider estimating 6 with squared error loss. Say 
the prior is a point mass at 0. Show that the rule 

<$’( 0) = 0, <$’(!)= 1, <5*(2) = 0 

is a Bayes rule. Moreover, show that it is not the unique Bayes rule, and it is not 
admissible. 

Problem 8.10 If it gives positive mass to all open sets and the risk function is 
continuous, then the Bayes rule with respect to this prior is admissible. 

Problem 8.11 Prove that if 8 M has constant risk and it is admissible, then 8 M is the 
unique minimax rule. 
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Shrinkage 


In this chapter we present a famous result due to Stein (1955) and further elaborated 
by James & Stein (1961). It concerns estimating several parameters as part of the 
same decision problem. Let x = (pci,... ,x p )' be distributed according to a multivari¬ 
ate p-dimensional N(0,I) distribution, and assume the multivariate quadratic loss 
function 

p 

L(9,a ) = J^(a, - 0,) 2 . 

1=1 

The estimator S(x) = x is the maximum likelihood estimator of 0, it is unbiased, 
and it has the smallest risk among all unbiased estimators. This would make it the 
almost perfect candidate from a frequentist angle. It is also the formal Bayes rule if 
one wishes to specify an improper flat prior on 0. However, Stein showed that such 
an estimator is not admissible. Oops. 

Robert describes the aftermath as follows: 

One of the major impacts of the Stein paradox is to signify the end of 
a “Golden Age” for classical statistics, since it shows that the quest 
for the best estimator, i.e., the unique minimax admissible estimator, is 
hopeless, unless one restricts the class of estimators to be considered or 
incorporates some prior information. ... its main consequence has been 
to reinforce the Bayesian-frequentist interface, by inducing frequentists 
to call for Bayesian techniques and Bayesians to robustify their estima¬ 
tors in terms of frequentist performances and prior uncertainty. (Robert 
1994, p. 67) 


Decision Theory: Principles and Approaches G. Parmigiani, L. Y. T. Inoue 
© 2009 John Wiley & Sons, Ltd 
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Efron adds: 


The implications for objective Bayesians and fiducialists have been 
especially disturbing.... If a satisfactory theory of objective Bayesian 
inference exists, Stein’s estimator shows that it must be a great deal more 
subtle than previously expected. (Efron 1978, p. 244) 


The Stein effect occurs even though the dimensions are unrelated, and it can be 
explained primarily through the joint loss L{6,a ) = ^’ =1 (9, — a,) 2 , which allows the 
dominating estimator to “borrow strength” across dimensions, and bring individual 
estimates closer together, trading off errors in one dimension with those in another. 
This is usually referred to as shrinkage. 

Our goal here is to build some intuition for the main results, and convey a sense 
for which approaches weather the storm in reasonable shape. In the end these include 
Bayes, but also empirical Bayes and some varieties of minimax. We first present the 
James-Stein theorem, straight up, no chaser, in the simplest form we know. We then 
look at some of the most popular attempts at intuitive explanations, and finally turn 
to more general formulations, both Bayes and minimax. 

Featured article: 

Stein, C. (1955). Inadmissibility of the usual estimator for the mean of a multi¬ 
variate normal distribution, Proceedings of the Third Berkeley Symposium on 
Mathematical Statistics and Probability 1: 197-206. 


The literature on shrinkage, its justifications and ramifications, is vast. For a more 
detailed discussion we refer to Berger (1985), Brandwein and Strawderman (1990), 
Robert (1994), and Schervish (1995) and the references therein. 


9.1 The Stein effect 

Theorem 9.1 Suppose that a p-dimensional vector x — (xj , x 2 ,... ,x p )' follows a 
multivariate normal distribution with mean vector 6 = ($i,... ,9 P )' and the identity 
covariance matrix I. Let .4 = 0 = ill'’, and let the loss be L(0,a) = ffdO, — a,) 2 . 
Then, for p > 2, the decision rule 


8 JS (x) — x 


P 2 ~ 


(9.1) 


dominates Six) = x. 
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Proof: The risk function of S is 

R(0,8)= I L(0,8(x))f(x\0)dx 

J m 

(Oi - Xi) 2 f(Xi\6i)dXj 


imp 

=n/< 

p 


i= 1 ^ 
p 

= Var w 0 -i= p- 

i= 1 

To compute the risk function of 8 JS we need two expressions: 
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(9.2) 

(9.3) 


where y follows a Poisson distribution with mean Y7i=i E I' 2 - We will prove these 
results later. First, let us see how these can be used to get us to the finish line: 


R(0,8 JS ) = f L(0,8 JS (x))f(x\0)dx 
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(P - 2) 2 


p 2 T 2y 


< p = R(0,8) 


and the theorem is proved. 

Now on to the two expectations in equations (9.2) and (9.3). See also James and 
Stein (1961), Baranchik (1973), Arnold (1981), or Schervish (1995, pp. 163-167). 
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We will first prove equation (9.3). A random variable U with a noncentral chi-square 
distribution with k degrees of freedom and noncentrality parameter 7. is a mixture 
with conditional distributions U\Y = y having a central chi-square distribution with 
k + 2y degrees of freedom and where the mixing variable Y has a Poisson distribution 
with mean X. In our problem, U = Y^'Li has a noncentral chi-square distribution 
with p degrees of freedom and noncentrality parameter equal to Y^=i g, 2 /2- Using the 
mixture representation of the noncentral chi-square, with Y ~ Poisson(^'’ =1 Of/ 2) 
and U\Y = y ~ x p + 2 ,> we g et 


' 1 ' 

— /r 

' 1' 

— /7 

^ „ 

' r 

— T7 

1 

Lzt.rJ 





_ li _ 

— £-v|0 

_/? + 2y - 2_ 


The last equality is based on the expression for the mean of a central inverse chi- 
square distribution. 

Next, to prove equation (9.2) we will use the following result: 
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The proof is left as an exercise (Problem 9.4). Now, 
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The assumption of unit variance is convenient but not required—any known 
p-dimensional vector of variances would lead to the same result as we can stan¬ 
dardize the observations. For the same reason, as long as the variance(s) are known, 
x itself could be a vector of sample means. 

The estimator 8 JS in Theorem 9.1 is the James-Stein estimator. It is a shrinkage 
estimator in that, in general, the individual components of 8 are closer to 0 than the 
corresponding components of x, so the /^-dimensional vector of observations is said 
to be shrunk towards 0. 

How does the James-Stein estimator itself behave under the admissibility crite¬ 
rion? It turns out that it too can be dominated. For example, the so-called truncated 
James-Stein estimator 


<T = * 


p-2 

Eli*? 


(9.5) 


is a variant that never switches the sign of an observation, and does dominate 8 JS . 
Here the subscript + denotes the positive part of the expression in the square brack¬ 
ets. Put more simply, when the stuff in these brackets is negative we replace it with 
0. This estimator is itself inadmissible. One way to see this is to use a complete 
class result due to Sacks (1963), who showed that any admissible estimate can be 
represented as a formal Bayes rule of the form 


/ (dO) 

f (dO)' 


These estimators are analytic in x while the positive part estimator is continuous 
but not differentiable. Baranchik (1970) also provided an estimator which domi¬ 
nates the James-Stein estimator. The James-Stein estimator is important for having 
revealed a weakness of standard approaches and for having pointed to shrinkage 
as an important direction for improvement. However, direct applications have not 
been numerous, perhaps because it is not difficult to find more intuitive, or better 
performing, implementations. 


9.2 Geometric and empirical Bayes heuristics 

9.2.1 Is x too big for 0 ? 

An argument that is often heard as “intuitive” support for the need to shrink is cap¬ 
tured by Figure 9.1. Stein would mention it in his lectures, and many have reported 
it since. The story goes more or less like this. In the setting of Theorem 9.1, because 
Ex\t[(x — 0)'0] = 0, we expect orthogonality between x — 0 and 0. Moreover, because 
E x \e[x'x] = p + 00, it may appear that x is too big as an estimator of 0 and thus that 
it may help to shorten it. For example, the projection of 0 on x might be closer. How¬ 
ever, the projection (1 — a)x depends on 0 through a. Therefore, we need to estimate 
a. Assuming that (i) the angle between 0 and x — 0 is right, (ii) x'x is close to its 
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x 



Figure 9.1 Geometric heuristic for the need to shrink. Vectors here live in 
p-dimensional spaces. Adapted with minor changes from Brandwein and Strawder- 
man (1990). 

expected value, 0 0 + p, and (iii) (x — 0)'(x — 0) is close to its expected value p. then 
using Pythagoras’s theorem in the right subtriangles in Figure 9.1 we obtain 


y'y — (x — 0)'(x — 0) — d 2 x'x 
= p — cfx'x 


and 


y'y = 0 0 - (1 - a fx’x 
= x'x-p-(\ — afx'x. 


By equating the above expressions, a — p/x'x is an estimate of a. Thus, an estimator 
for 0 would be 



(9.6) 


similar to the James-Stein estimator introduced in Section 9.1. 

While this heuristic does not have any pretense of rigor, it is suggestive. However, 
a doubt remains about how much insight can be gleaned from the emphasis placed 
on the origin as the shrinkage point. For example, Efron (1978) points out that if one 
specifies an arbitrary origin x 0 and defines the estimator 


S x„ =x B + (x-Xo) 1 - 


P~ 2 


(9.7) 


(x - x 0 )'(x - x 0 ) 


to shrink towards the x„ instead of 0 , this estimator also dominates x. Berger (1985) 
has details and references for various implementations of this idea. For many choices 
of x,, the heuristic we described breaks down, but the Stein effect still works. 
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One way in which the insight of Figure 9.1 is sometimes summarized is by say¬ 
ing that “0 is closer to 0 than x.” Is this reasonable? Maybe. But then shrinkage 
works for every x 0 . It certainly seems harder to claim that “0 is closer to x 0 than 
x” irrespective of what x 0 is. Similarly, some skepticism is probably useful when 
considering heuristics based on the fact that the probability that x'x > 00 can be 
quite large even for relatively small p and large 0. Of course this is true, but it is 
less clear whether this is why shrinkage works. For more food for thought, observe 
that, overall, shrinkage is more pronounced when the datax are closer to the origin. 
Perhaps a more useful perspective is that of “borrowing strength” across dimensions, 
and shrinking when dimensions look similar. This is made formal in the Bayes and 
empirical Bayes approaches considered next. 

9.2.2 Empirical Bayes shrinkage 

The idea behind empirical Bayes approaches is to estimate the prior empirically. 
When one has a single estimation problem this is generally quite hard, but when 
a battery of problems are considered together, as is the case in this chapter, this 
is possible, and often quite useful. The earliest example is due to Robbins (1956) 
while Efron and Morris highlighted the relevance of this approach for shrinkage and 
for the so-called hierarchical models (Efron and Morris 1973b, Efron and Morris 
1973a). 

A relatively general formulation could go as follows. The p-dimensional obser¬ 
vation vectorx has distribution f(x\0), while parameters 0 have distribution jr(0 r) 
with r unknown and of dimension generally much lower than p. A classic exam¬ 
ple is one where x, represents a noisy measurement of 0, —say, the observed and 
real weight of a tree. The distribution tz(0\t) describes the variation of the unob¬ 
served true measurements across the population. This model is an example of a 
multilevel model, with one level representing the noise in the measurements and the 
other the population variation. Depending on the study design, the 0 may be called 
random effects. Multilevel models are now a mainstay of applied statistics and the 
primary venue for shrinkage in practical applications (Congdon 2001, Ferreira and 
Lee 2007). 

If interest is in 0, an empirical Bayes estimator can be obtained by deriving the 
marginal likelihood 



and using it to identify a reasonable estimator of r, say f (x). This estimator is then 
plugged back into ?r, so that a prior is now available for use, via the Bayes rule, to 
obtain the Bayes estimator for 0. Under squared error loss this is 


S EB (x) = E\0 |x, r(x)]. 


An empirical Bayes analysis does not correspond to a coherent Bayesian updat¬ 
ing, since the data are used twice, but it has nice properties that contributed to its 
wide use. 
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While we use the letter it for the distribution of 0 given r, this distribution is a 
somewhat intermediate creation between a prior and a likelihood: its interpretation 
can vary significantly with the context, and it can be empirically estimable in some 
problems. A Bayesian analysis of this model would also assign a prior to r, and 
proceed with coherent updating and expected loss minimization as in Theorem 7.1. 
If p is large and the likelihood is more concentrated than the prior, that is the noise 
is small compared to the variation across the population, the Bayes and empirical 
Bayes estimators of 0 will be close. 

Returning to our /r-dimensional vector x from a multivariate normal distribution 
with mean vector 0 and covariance matrix /, we can derive easily an empirical Bayes 
estimator as follows. Assume that a priori 0 ~ N( 0, r 0 2 /). The Bayes estimator of 0. 
under a quadratic loss function, is the posterior mean, that is 

*:„w=,(9-8) 

1 I T o 

assuming r 0 2 is fixed. The empirical Bayes estimator of 0 is found as follows. First, 
we find the unconditional distribution of x which is, in this case, normal with mean 
0 and variance (1 + r 0 2 )/. Second, we find a reasonable estimator for r 0 . Actually, in 
this case, it makes sense to aim directly at the shrinkage factor (1 + r 2 ) '. Because 

XX ~ (1 + r o)X p 2 


and 

e [(p - 2)/ XX] = (i + T o 2 r‘ 

then (p — 2)/ ^ xf is an unbiased estimator of 1 /(I + r 0 2 ). Plugging this directly into 
equation (9.8) gives the empirical Bayes estimator 

*“ w =(' -£!)*- (9 - 9) 

This is the James-Stein estimator! We can reinterpret it as a weighted average 
between a prior mean of 0 and an observed measurement x, with weights that are 
learned completely from the data. The prior mean does not have to be zero: in fact 
versions of this where one shrinks towards the empirical average of x can also be 
shown to dominate the MLE, and to have an empirical Bayes justification. 

Instead of using an unbiased estimator of the shrinkage factor, one could, alter¬ 
natively, find an estimator for r 2 using the maximum likelihood approach, which 
leads to 



if T!L 

otherwise 


x]>p 


(9.10) 
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and the corresponding empirical Bayes estimator is 


= i 2 x/(l + f 2 ) 


= x 



similar to the estimator of equation (9.5). 


(9.11) 

(9.12) 


9.3 General shrinkage functions 

9.3.1 Unbiased estimation of the risk of x + g(x) 

Stein developed a beautiful way of providing insight into the subtle issues raised in 
this chapter, by setting very general conditions for estimators of the form S g (x) = 
x + g(x) to dominate <S(jc) = x. We begin with two preparatory lemmas. All the 
results in this section are under the assumptions of Theorem 9.1. 


Lemma 1 Let y be distributed as a N(0, 1). Let h be the indefinite integral of a 
measurable function h' such that 


lim h(y) exp --(y - Of) =0. 

y->±oo \ 2 


Then 


E y \e[h(y)(y - 9)} = E yW [h'(y)l 

Proof: Starting from the left hand side, 
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(9.13) 


(9.14) 
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dy 

1 f + °° , 

— / h(y)ex p 

\f 2 7T J —oo 


(y-of 


= E yle [h'(y)]. 

Assumption (9.13) is used in the integration by parts. 


dy 

(9.15) 

□ 


Weaker versions of assumption (9.13) are used in Stein (1981). Incidentally, the 
converse of this lemma is also true, so if equation (9.14) holds for all reasonable h, 
then y must be normal. 

The next lemma cranks this result up to p dimensions, and looks at the covari¬ 
ance between the error x — 6 and the shrinkage function g(x). It requires a technical 
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differentiability condition and some more notation: a function h : 9U —*■ 91 is almost 
differentiable if there exists a function V h : 9t p -* 9t p such that 


h(x + z) — h(x) = 



z'V/i(. x + tz)dt. 


where z, is in 91'’ and V can also be thought of as the vector differential operator 


V = 



A function g : 9t p —> 9i'' is almost differentiable if every coordinate is. 
Lemma 2 Ifh is almost differentiable and h(x))'(Vh(x))] < oo, then 


E x \e[h(x)'(x - 0)] = £ x „[VA(x)]. (9.16) 


Using these two lemmas, the next theorem provides a usable closed form for the 
risk of estimators of the form S g (x) — x + g(x). Look at Stein (1981) for proofs. 

Theorem 9.2 (Unbiased estimation of risk) If g : 9U —► 91 p is almost diff¬ 
erentiable and such that 

p 

l V 'S'tol] < 00 

1=1 

then 


E x \e [(-t + g(x) - Q)’(x + g(x) - 6 )j = p + E xW [(g(A:))'(g(x)) 

+ 2^ y igi(x) . 


(9.17) 


The risk is decomposable into the risk of the estimator x plus a component that 
depends on the shrinkage function g. A corollary of this result, and the key of our 
exercise, is that if we can specify g so that 

(g(x))'(g(x)) + 2J2 Vigi(x) <0 (9.18) 


for all values of x (with at least a strict inequality somewhere), then <S,,(x) dominates 
8(x) = x. This argument encapsulates the bias-variance trade-off in shrinkage esti¬ 
mators: shrinkage works when the negative covariance between errors x — 0 and 
corrections g(x) more than offsets the bias induced by g(x). Because of the additivity 
of the losses across dimensions, we can trade off bias in one dimension for variance 
in another. This aspect more than any other sets the Stein estimation setting apart 
from problems with a single parameter of interest. 
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Before we move to discussing ways to choose g, note that the right hand side of 
equation (9.17) is the risk of S g , so the quantity 

P + (g(x))'(g(*)) + 2 X! V &( x ) 

i 

is an unbiased estimator of the unknown R{S g , 0). The technique of finding an unbi¬ 
ased estimator of the risk directly can be useful in general. Stein, for example, 
suggests that it could be used for choosing estimators that, after the data are observed, 
have small estimated risk. More discussion appears in Stein (1981). 

9.3.2 Bayes and minimax shrinkage 

There is a solid connection between estimators like S g and Bayes rules. As usual, call 
mix) the marginal density of x, that is mix) = f f(x\0)n(0)d6. Then it turns out that, 
using the gradient representation, 

E[0 |jr] = x + V log mix). 


This is because 


V log m{x) = 


— —x 


= X 


■ fix — 0)exp(—(jc — 0)\x — 6)/2)jti0)d6 
f exp(— (jc — 0)'ix — 0)/2)ixi0)d0 
f 0 exp (—ix — 0)\x — 0)/2)ni0) d0 


f exp(-(x 
■E[0\x]. 


0)'ix — 0)ni0)d0 


Thus setting gix ) = V log mix), for some prior jr, is a very promising choice of g, as 
are, more generally, functions m from 9f p to the positive real line. When we restrict 
attention to these, the condition for dominance given in inequality (9.18) becomes, 
after a little calculus, 


igixyUgix)) + 2 £ V, gl (x) 


4 V 2 ^/m(x) 

s/mix) 


Thus, to produce an estimator that dominates x it is enough to find a function m with 
mix) < 0 or y 2 m(x) < 0—the latter are called superharmonic functions. 

If we do not require that mix) is a proper density, we can choose mix) = |x| 

For p > 2 this is a superharmonic function and V log mix) = —x(p — 2)/x'x, so it 
yields the James-Stein estimator of Theorem 9.1. 

The estimator <) X '(x) = x is a minimax rule, but it is not the only minimax rule. 
Any rule whose risk is bounded by p will also be minimax, so it is not too difficult to 
concoct shrinkage approaches that are minimax. This has been a topic of great inter¬ 
est, and the literature is vast. Our brief discussion is based on a review by Brandwein 
and Strawderman (1990), which is also a perfect entry point if you are interested in 


more. 
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A good starting point is the James-Stein estimator S ,s , which is minimax, 
because its risk is bounded by p. Can we use the form of S JS to motivate a broader 
class of minimax estimators? This result is an example: 


Theorem 9.3 If h is a monotone increasing function such that 0 < h < 2[p — 2), 
then the estimator 


S M (x) = x 



h(x'x) 

XX 


is minimax. 


Proof: In this proof we assume that h(x'x)/x'x follows the conditions stated in 
Lemma 1. Then, by applying Lemma 1, 


^ x\e 


' av Kx'xf 

(.x — 6)x - 

x'x 


= (p- 2)E xie 


h(x'x) 

x'x 


+ 2E x \ e 


h'ix'x) 

x'x 


>{p- 2)E xW 


h{x'x) 

x'x 


(9.19) 


The above inequality follows from the fact that h is positive and increasing, which 
implies h' > 0. Moreover, by assumption, 


0 < h(x’x) < 2 (p - 2), 


which implies that for nonnull vectors x. 


0<^< 2(p -2)^. 

x'x x'x 


Therefore, 


If (x'x) 

XX 


2(p - 2)E X] 


h(x'x)' 
x'x 


(9.20) 


By combining inequalities (9.19) and (9.20) above we conclude that the risk of this 
estimator is 


R(S , 0) = E x 

= P + E 


(x-0)- 

h 2 (x'xf 

x'x 


, 1 

(x-0)- 


x'x J 



x'x 


r (x 

-9)'x 

- 

— 2E x \ e 


- h(xx) 


_ 

XX 

_ 



" h(x'x) 


P ~ ^)\^x\e 


= p- 


In Section 7.7 we proved that x is minimax and it has constant risk p. Since the 
risk of o is at most p, it is minimax. □ 
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The above lemma gives us a class of shrinkage estimators. The Bayes and gener¬ 
alized Bayes estimators may be found in this class. Consider a hierarchical model 
defined as follows: conditional on X, 6 is distributed as N(0, (I — 7.)//./). Also, 
X ~ (1 — b)X~ b for 0 < h < 1. The Bayes estimator is the posterior mean of 0 
given by 


<$*(*) = Zs[0|jc] = E[E[0\x, k]|jc] 
1 


= E 


1 - 


1 + [(1 - k)/X] 

= [1 - E[A.|jc]]jc. 


(9.21) 


The next theorem gives conditions under which the Bayes estimator given by (9.21) 
is minimax. 


Theorem 9.4 1. For p > 5, the proper Bayes estimator S* from (9.21) is minimax 

as long as b > (6 — p)/2. 

2. For p > 3, the generalized Bayes estimator 8* from (9.21) is minimax if 
(6 — p)/2 < b < (p + 2)/2. 

Proof: One can show that the posterior mean of X is 


£[A,|jc] 


1 

XX 


p + 2 — 2b — 


2exp(—(x'x)/2) 
A. (1 / 2)p_fc exp(— (X/2)x’x)dX 


h(x'x) 

XX 


(9.22) 


where h(x’x) is the term in square brackets on the right side of expression (9.22). 

Observe that h(x'x) < p + 2 — 2b. Moreover, it is monotone increasing because 
f Q X a/2)p ~ h exp(— (X/2)x'x)dX is increasing. Application of Theorem 9.3 completes 
the proof. □ 


We saw earlier that the Bayes rules, under very general conditions, can be writ¬ 
ten as x + V log mix), and that a sufficient condition for these rules to dominate 8 
is that m be superharmonic. It is therefore no surprise that we have the following 
theorem. 


Theorem 9.5 If it(0) is superharmonic, then x + V log m(x) is minimax. 

This theorem provides a broader intersection of the Bayes and minimax shrink¬ 
age approaches. 
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9.4 Shrinkage with different likelihood and losses 

A Stein effect can also be observed when considering different sampling distributions 
as well as different loss functions. Brown (1975) discusses inadmissibility of the 
mean under whole families of loss functions. 

On the likelihood side an important generalization is to the case of unknown 
variance. Stein (1981) discusses an extension of the results presented in this chapter 
using the unbiased estimation of risk technique. More broadly, spherically symmetric 
distributions are distributions with density /((jc — 0)'(x — 0 )) in 9t p . When p > 2, 
<S(jc) = x is inadmissible when estimating 6 with any of these distributions. Formally 
we have: 

Theorem 9.6 Let z — (x,y) in 9F, with distribution 

z ~/((* — 0)'(x — 0) +/ 3 O, 
and x e 91 q ,y e 91 p_? . The estimator 

&h(z) = (1 - h(x'x,y'y))x 

dominates 6(jc) = jc under quadratic loss if there exist a, f > 0 such that: 

(i) fh(t, u) is a nondecreasing function of t for every u; 

(ii) u~^h(t, u) is a nonincreasing function of u for every t; and 

(Hi) 0 < ( t/u)h(t , u) < 2(q — 2 )a/(p — q — 2 + 4/1). 

For details on the proof see Robert (1994, pp. 67-68). Moreover, the Stein effect is 
robust in the class of spherically symmetric distributions with finite quadratic risk 
since the conditions on /? do not depend on/. 

In discrete data problems, there are, however, significant exceptions. For exam¬ 
ple, Alam (1979) and Brown (1981) show that the maximum likelihood estimator 
is admissible for estimating several binomial parameters under squared error loss. 
Also, the MLE is admissible for a vector of multinomial probabilities and a variety 
of other discrete problems. 


9.5 Exercises 

Problem 9.1 Consider a sample 

jc = (x] , 0,6,5,9,0,13,0,26,1,3,0,0,4,0,34,21,14,1,9)' 

where each x, ~ Foil" a,). Let S EB denote the 20-dimensional empirical Bayes decision 
rule for estimating the vector A. under squared error loss and let 8f B (x) be the first 
coordinate, that is the estimate of ). t . 
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Write a computer program to plot 8f B (x) versus x, assuming the following prior 
distributions: 


1. A* ~ Exp(l), i = 1,..., 20 

2. A; ~ Exp(y), i = 1,..., 20, y > 0 

3. Aj ~ Exp(y), i = 1,..., 20, y either 1 or 10 

4. A,- ~ (1 — a)I 0 + aExp(y), a e (0,1), y > 0, 

where Exp(y) is an exponential with mean 1/y. 

Write a computer program to graph the risk functions corresponding to each of 
the choices above, as you vary ). t and fix the other coordinates A, = x,, i > 2. 


Problem 9.2 If x is a p-dimensional normal with mean 8 and identity covariance 
matrix, and g is a continuous piecewise continuously differentiable function from 
to SH P satisfying 


lim 

| x|->0 


dgi(x) 

9x,- 


exp 


(-0.25^) 


= 0 , 


then 


E x \ t [(x + g(x) - 8)'(x + g(x) - 8)} =p + E xl „[g(x)'g(x) + 2Vg(x)], 
where V is the vector differential operator with coordinates 

V, = —. 

9x,- 


You can take this result for granted. 
Consider the estimator 


S(*) = x + g(x) 


where 


g(x) = 


k—2 


-x t 


if M < z (k ) 


x' t x t 
k—2 

- —z w sign(x,-) if \xj | > z m , 


Xix, 


where z is the vector of order statistics of x, and 

K x * = ^ min (^,4)- 


This estimator is discussed in Stein (1981). 
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Questions: 

1. Why would anyone ever want to use this estimator? 

2. Using the result above, show that the risk is 


P — (k — 2) 2 E xie 


x'x t 


3. Find a vector 6 for which this estimator has lower risk than the James-Stein 
estimator. 

Problem 9.3 For p = 10, graph the risk of the two empirical Bayes estimators of 
Section 9.2.2 as a function of 6 0. 

Problem 9.4 Prove equation (9.4) using the assumptions of Theorem 9.1. 
Problem 9.5 Construct an estimator with the following two properties: 

1. 8 is the limit of a sequence of Bayes rules as a hyperparamter gets closer to 
its extreme. 

2. <S is not admissible. 

You do not need to look very far. 





10 

Scoring rules 


In Chapter 2 we studied coherence and explored relations between “acting rationally” 
and using probability to measure uncertainty about unknown events. De Finetti’s 
“Dutch Book” theorem guaranteed that, if one wants to avoid a sure loss (that is, 
be coherent), then probability calculus ought to be used to represent uncertainty. 
The conclusion was that, regardless of one’s beliefs, it is incoherent not to express 
such beliefs in the form of some probability distribution. In our discussion, the rela¬ 
tionship between the agent’s own knowledge and expertise, empirical evidence, and 
the probability distribution used to set fair betting odds was left unexplored. In this 
chapter we will focus on two related questions. The first is how to guarantee that 
probability assessors reveal their knowledge and expertise about unknowns in their 
announced probabilities. The second is how to evaluate, after events have occurred, 
whether their announced probabilities are “good.” 

To make our discussion concrete, we will focus on a simple situation, in which 
assessors have a clear interest in the quality of their announced probabilities. The 
prototypical example is forecasting. We will talk about weather forecasting, partly 
because it is intuitive, and partly because meteorologists were among the first to 
realize the importance of this problem. Weather forecasting is a good example to 
illustrate these concepts, but clearly not the only application of these ideas. In fact, 
any time you are trying to empirically compare different prediction algorithms or 
scientific theories you are in a situation similar to this. 

In a landmark paper. Brier described the issue as follows: 

Verification of weather forecasts has been a controversial subject for 
more than a half century. There are a number of reasons why this prob¬ 
lem has been so perplexing to meteorologists and others but one of the 
most important difficulties seems to be in reaching an agreement on 
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the specification of a scale of goodness of weather forecasts. Numerous 
systems have been proposed but one of the greatest arguments raised 
against forecast verification is that forecasts which may be the “best” 
according to the accepted system of arbitrary scores may not be the most 
useful forecasts. In attempting to resolve this difficulty the forecaster 
may often find himself in the position of choosing to ignore the veri¬ 
fication system or to let it do the forecasting for him by “hedging” or 
“playing the system”. This may lead the forecaster to forecast something 
other than what he thinks will occur, for it is often easier to analyze the 
effect of different possible forecasts on the verification score than it is 
to analyze the weather situation. It is generally agreed that this state of 
affairs is unsatisfactory, as one essential criterion for satisfactory verifi¬ 
cation is that the verification scheme should influence the forecaster in 
no undesirable way. (Brier 1950, p. 1) 

We will study these questions formally by considering the measures used for the 
evaluation of forecasters. We will talk about scoring rules. In relation to our dis¬ 
cussion this far, these can be thought of as loss functions for the decision problem of 
choosing a probability distribution. The process by which past data are used in reach¬ 
ing a prediction is not formally modeled—in contrast to previous chapters whose foci 
were decision functions. However, data will come into play in the evaluation of the 
forecasts. 

Featured articles: 

Brier, G. (1950). Verification of forecasts expressed in terms of probability. Monthly 
Weather Review 78: 1-3. 

Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society, 
Series B, Methodological 14: 107-114. 

In the statistical literature, the earliest example we know of a scoring rule is 
Good (1952). Winkler (1969) discusses differences between using scoring rules for 
assessing probability forecasts and using them for elicitation. A general discussion 
of the role of scoring rules in elicitation can be found in Savage (1971) and Lind- 
ley (1982b). Additional references and discussion are in Bernardo and Smith (1994, 
Sec. 2.7). 


10.1 Betting and forecasting 

We begin with two simple illustrations in which announcing one’s own personal 
probabilities will be the best thing to do for the forecaster. 

The first example brings us back to the setting of Chapter 2. Suppose you are 
a bookmaker posting odds q : (1 — q) on the occurrence of an event 9. Say your 
personal beliefs about 9 are represented by n. Both tt and q are points in [0,1]. As 
the bookmaker, you get —(1 — q)S if 9 occurs, and qS if 9 does not occur, where, as 
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before, S is the stake and it can be positive or negative. Your expected gain, which in 
this case could also be positive or negative, is 

—(1 — q)Sn + qS( 1 — jt) — S(q — n). 

Suppose you have to play by de Finetti’s rules and post odds at which you are indif¬ 
ferent between taking bets with positive or negative stakes on 9. In order for you to 
be indifferent you need to set q = n. So in the betting game studied by de Finetti, 
a bookmaker interested in optimizing expected utility will choose betting odds that 
correspond to his or her personal probability. On the other hand, someone betting 
with you and holding beliefs n' about 0 will have an expected gain of S{n' — q) and 
will have an incentive to bet a positive stake on 9 whenever n' > q. 

The second example is a cartoon version of the problem of ranking forecasters. 
Commencement day is two days away and a departmental ceremony will be held 
outdoors. As one of the organizers you decide to seek the opinion of an expert about 
the possibility of rain. You can choose among two experts and you want to do so 
based on how well they predict whether it will rain tomorrow. Then you will decide 
on one of the experts and ignore the other, at least as far as the commencement day 
prediction goes. You are going to come up with a rule to rank them and you are going 
to tell them ahead of time which rule you will use. You are worried that if you do 
not specify the rule well enough, they may report to you something that is not their 
own best prediction, in an attempt to game the system—the same concern that was 
mentioned by Brier earlier. 

So this is what you come up with. Let the event “rain tomorrow” be represented 
here by 9, and suppose that q is a forecaster’s probability of rain for tomorrow. If 9 
occurs, you would like q to be close to 1; if 9 does not occur you would like q to 
be close to 0. So you decide to assign to each forecaster the score (q — Of, and then 
choose the forecaster with the lowest score. 

With this scoring rule, is it in a forecaster’s best interest to announce his or her 
own personal beliefs about 9 in playing this game? Let us call a forecaster’s own 
probability of rain tomorrow jr, and compute the expected score, which is 

(q — 1) 2 jt + q 2 ( 1 — tt) = (q — nf + tt( 1 — jr). 

To make this as small as possible a forecaster must choose q = n. 

Incidentally, choosing one forecaster only and ignoring the other may not neces¬ 
sarily be the best approach from your point of view: there may be ways of combining 
the two to improve on both forecasts. A discussion of that would be too long a detour, 
but you can look at Genest and Zidek (1986) for a review. 

10.2 Scoring rules 

10.2.1 Definition 

We are now ready to make these considerations more formal. Think of a situation 
where a forecaster needs to announce the probability of a certain event. To keep 
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things from getting too technically complex we will assume that this event has a 
finite number J of possible mutually exclusive outcomes, represented by the vector 
6 = Oj) of outcome indicators. When we say that 0 I occurred we mean that 

Oj — 1 and therefore 0, = 0 for i ^ j. The forecaster needs to announce probabilities 
for each of the J possibilities. The vector q = (q ,,..., q,) contains the announced 
probabilities for the corresponding indicator. Typically the action space for the choice 
of q will be the set Q of all probability distributions in the J— 1 dimensional simplex. 
A scoring rule is a measure of the quality of the forecast q against the observed event 
outcome 9. Specifically: 

Definition 10.1 (Scoring rule) A scoring rule s for the probability distribution q 
is a function assigning a real number s(9,q) to each combination (6,q). 

We will often use the shorthand notation s(9j,q ) to indicate s(0,q) evaluated 
at 9j = 1 and 0, = 0 for i f j. This does not mean that the scoring rule 
only depends on one of the events in the partition. Note that after 0, is observed, 
the score is computed based potentially on the entire vector of forecasts, not just 
the probability of 6 r In the terminology of earlier lectures, s(0 r q) represents the 
loss when choosing q as the announced probability distribution, and 9j turns out 
to be the true state of the world. We will say that a scoring rule is smooth if it 
is continuously differentiable in each q r This requires that small variations in the 
announced probabilities would produce only small variations in the score, which is 
often reasonable. 

Because 9 is unknown when the forecast is made, so is the score. We assume 
that the forecaster is a utility maximizer, and will choose q to minimize the 
expected score. This will be done based on his or her coherent personal probabil¬ 
ities, denoted by it = (tt u ..., Ttj). The expected loss from the point of view of the 
forecaster is 


j 



( 10 . 1 ) 


Note the different role played by the announced probabilities q that enter the score 
function, and the forecaster’s beliefs that are weights in the computation of the 
expected score. 

The fact that the forecaster seeks a Bayesian solution to this problem does not 
necessarily mean that the distribution q is based on Bayesian inference, only that the 
forecaster acts as a Bayesian in deciding which q will be best to report. 

10.2.2 Proper scoring rules 

In both of our examples in Section 10.1, the scoring rule was such that the deci¬ 
sion maker’s expected score was minimized by it, the personal beliefs. Whenever 
this happens, no matter what the beliefs are, the scoring rule is called proper. More 
formally: 
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Definition 10.2 (Proper scoring rule) A scoring rule s is proper if and only if, for 
each strictly positive it, 

infS(q) = S(jr). (10.2) 

If it is the only solution that minimizes the expected score, then the scoring rule is 
called strictly proper. 

Winkler observes that this property is useful for both probability assessment and 
probability evaluation: 

In an ex ante sense, strictly proper scoring rules provide an incentive 
for careful and honest forecasting by the forecaster or forecast system. 

In an ex post sense, they reward accurate forecasts and penalize inferior 
forecasts. (Winkler 1969, p. 2) 

Proper scoring rules can also be used as devices for eliciting one’s subjective 
probability. Elicitation of expectations is studied by Savage (1971) who mathemati¬ 
cally developed general forms for the scoring rules that are appropriate for eliciting 
one’s expectation. 

Other proper scoring rules are the logarithmic (Good 1952) 


s(9j,q) = - log qj 


and the spherical rule (Savage 1971) 


s(0j,q) = - 


(E ,«t) 


1/2 ’ 


10.2.3 The quadratic scoring rules 

Perhaps the best known proper scoring rule is the quadratic rule, already encountered 
in Section 10.1. The quadratic rule was introduced by Brier (1950) to provide a 
“verification score’’ for weather forecasts. A mathematical derivation of the quadratic 
scoring rule can be found in Savage (1971). A quadratic scoring rule is a function of 
the form 

j 

<0j, q) — A J^iq, ~ W + Bj, (10.3) 

;= l 

with A > 0. Here l (i=j) is I if i = j and 0 otherwise. 

Take a minute to convince yourself that the rule of the commencement party 
example of Section 10.1 is a special case of this, occurring when J = {1,2}. Find the 
values of A and If implied by that example, and sort out why in the example there is 
only one term to s while here there are J. 
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Now we are going to show that the quadratic scoring rule is proper; that is, that 
q — n is the Bayes decision. To do so, we have to minimize, with respect to q, the 
expected score 



subject to the constraint that the elements in q sum to one. The Lagrangian for this 
constrained minimization is 



\j= i 

Taking partial derivatives with respect to q k we have 



Setting q k = n k and A. = 0 satisfies the first-order conditions and the constraint, since 
JT Jij = 1. We should also verify that this gives a minimum. This is not difficult, but 
not insightful either, and we will not go over it here. 

10.2.4 Scoring rules that are not proper 

It is not difficult to construct scoring rules that look reasonable but are not proper. 
For example, go back to the commencement example, and say you state the score 
\q — 6\. From the forecaster’s point of view, the expected score is 


(1 — q) 7T + q(l — 7r) = 7T + (1 — 2n)q. 


This is linear in q and is minimized by announcing q = 1 if jt > 0.5 and q = 0 if 
n < 0.5. If 7t is exactly 0.5 then the expected score is flat and any q will do. You get 
very little information out of the forecaster with this scoring system. Problem 10.1 
gives you a chance to work out the details in the more general cases. 

Another temptation you should probably resist if you want to extract honest 
beliefs is that of including in the score the consequences of the forecaster’s error 
on your own decision making. Say a company asks a geologist to forecast whether a 
large amount of oil is available (66) or not (66) in the region. Say q is the announced 
probability. After drilling, the geologist’s forecast is evaluated by the company 
according to the following (modified quadratic) scoring rule: 


s(0 u q) = (qi ~ l) 2 

m,q)= I0fe-l) 2 ; 


that is, the penalty for an error when the event is false is 10 times the penalty when 
the event is true. It may very well be that the losses to the company are different in 
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the two cases, but that should affect the use the company makes of the probability 
obtained by the geologist, not the way the geologist’s accuracy is rewarded. To see 
that note that the geologist will choose q that minimizes the expected score. Because 
qi = l-qi 

S(q) = (q, - 1) 2 7r + 10^(1 - i r). 

The minimum is attained when q { = 7r/(10 — 9 tt). Clearly, this rule is not proper. 
The announced probability is always smaller than the geologist’s own probability tt 
when 7 r e (0,1). 

One may observe at this point that as long as the announced q t is an invertible 
function of tc the company is no worse off with the scoring rule above than it would 
be with a proper rule, as long as the forecasts are not taken at face value. This is 
somewhat general. Lindley (1982b) considers the case where the forecaster is scored 
based on the function s(Q u q ) if 0, occurs, and s(0 2 ,q) if 0 2 occurs. So the forecaster 
should minimize his or her expected score 

S(q) = Jts(P u q) + (1 - n)s(0 2 ,q). 

If the utility function has a first derivative, the forecaster should obtain q as a solution 
of the equation dS/dqi = 0 under the constraint q 2 = I — c/,. This implies ns'(0\ .q)+ 
(1 — Tr)s'(6 2 ,q) — 0. Solving, we obtain 


7T(q) = 


■?'(02,g) 

s'(d 2 ,q) - s'(0i,q) 


(10.5) 


which means that the probability tt can be recovered via a transformation of the 
stated value q. Can any value q lead to a probability? Lindley studied this problem 
and verified that if q solves the first-order equation above, then its transformation 
obeys the laws of probability. So while a proper scoring rule guarantees that the 
announced probabilities are the forecaster’s own, other scoring rules may achieve the 
goal of extracting sufficient information to reconstruct the forecaster’s probabilities. 


10.3 Local scoring rules 

A further restriction that one may impose on a scoring rule is the following. When 
event 0, occurs, the forecaster is scored only on the basis of what was announced 
about 0j, and not on the basis of what was announced for the events that did not 
occur. Such a scoring rule is called local. Formally we have the following. 

Definition 10.3 (Local scoring rule) A scoring rule s is local if there exist 
functions Sj(.),j e J, such that s(dj,q) = sfqf). 

For example, the logarithmic scoring rule s(9j,q) = — log qj is local. The lower 
the probability assigned to the observed event, the higher the score. The rest of the 
announced probabilities are not taken into consideration. The quadratic scoring rule 
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is not local. Say J = 3, A = 1, and = 0. Suppose we observe 0, = 1. The vectors 
q = (0.5,0.25,0.25) and q' — (0.5,0.5,0) both assigned a probability of 0.5 to the 
event 0 1; but q gets a score of 0.5 while q gets a score of 0.375. This is because the 
quadratic rule penalizes a single error of size 0.5 more than it penalizes two errors 
of size 0.25. It can be debated whether this is appropriate or not. If you are tempted 
to replace the square in the quadratic rule with an absolute value, remember that it is 
not proper (see Problem 10.1). Another factor to consider in thinking about whether 
a local scoring rule is appropriate for a specific problem has to do with whether there 
is a natural ordering or a notion of closeness for the elements of the partition. A local 
rule gives no “partial credit” for near misses. If one is predicting the eye color of a 
newborn, who turns out to have green eyes, one may assess the prediction based only 
on the probability of green, and not worry about how the complement was distributed 
among, say, brown and blue. Things start getting trickier if we consider three events 
like “fair weather,” “rain,” and “snow” on a winter’s day. Another example where a 
local rule may not be attractive is when the events in the partitions are the outcomes 
of a quantitative random variable, such as a count. Then the issue of near misses may 
be critical. 

Naturally, having a local scoring rule gives us many fewer things to worry about 
and makes proving theorems easier. For example, the local property leads to a nice 
functional characterization of the scoring rule. Specifically, every smooth, proper, 
and local scoring rule can be written as Bj + A log q, for A < 0. This means that if we 
are interested in choosing a scoring rule for assessing probability forecasters, and we 
find that smooth, proper, and local are reasonable requirements, we can restrict our 
choice to logarithmic scoring rules. This is usually considered a very strong point 
in favor of logarithmic scoring rules. We will come back to it later when discussing 
how to measure the information provided by an experiment. 

Theorem 10.1 (Proper local scoring rules) If s is a smooth, proper, and local 
scoring rule for probability distributions q defined over 0, then it must be of the 
form s(6j, q) = Bj + A log q jt where A < 0 and the Bj are arbitrary constants. 

Proof: Assume that jtj > 0 for all j. Using the fact that s is local we can write 


infS(< 7 ) = inf f s{6j,q)jXj = inf } Sjiqfijtj. 

q q • J q ‘ J 


Because .v is smooth, S, which is a continuous function of s, will also be smooth. 
Therefore, in order for u to be a minimum, it has to satisfy the first-order conditions 
for the Lagrangian: 



Differentiating, 



( 10 . 6 ) 
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Because s is proper, the minimum expected score is achieved when q = it . So, if jt 
is a minimum, it must be that 



(10.7) 


jeJ. 


Integrating both sides of (10.7), we get that each Sj must be of the form Sj(itj) — 
—X log Ttj + Bj. In order to guarantee that the extremal found is a minimum we must 


□ 


have X > 0. 


The logarithmic scoring rule goes back to Good, who pointed out: 

A reasonable fee to pay to an expert who has estimated a probability 
as q is A log(2 q) if the event occurs and A log(2 — 2 q) if the event does 
not occur. If q > 1/2 the latter payment is really a fine. . . . This fee 
can easily be seen to have the desirable property that its expectation is 
minimized if q = it, the true probability, so that it is in the expert’s own 
interest to give an objective estimate. (Good 1952, p. 112 with notational 
changes) 

We changed Good’s notation to match ours. It is interesting to see how Good 
specifies the constants B, to set up a reward structure that could correspond to both a 
win and a loss, and what is his justification for this. It is also remarkable how Good 
uses the word “objective” for “true to one’s own knowledge and beliefs.” The reason 
for this is in what follows: 

It is also in his interest to collect as much evidence as possible. Note that 
no fee is paid if q = 1 /2. The justification of this is that if a larger fee 
was paid the expert would have a positive expected gain by saying that 
q = 1/2 without looking at the evidence at all. (Good 1952, p. 112 with 
notational changes) 

Imposing the condition that a scoring rule is local is not the only way to narrow 
the set of candidate proper scoring rules one may consider. Savage (1971) shows that 
the quadratic loss is the only proper scoring rule that satisfies the following condi¬ 
tions: first, the expected score S must be symmetric in it and q, so that reversing 
the role of beliefs and announced probabilities does not change the expected score. 
Second, S must depend on n and q only through their difference. The second con¬ 
dition highlights one of the rigidities of squared error loss: the pairs (0.5,0.6) and 
(10 -12 ,0.1 + 10 l2 ) are considered equally, though in practice the two discrepancies 
may have very different implications. 

We have not considered the effect of a scoring rule on how one collects and uses 
evidence, but it is an important topic, and we will return to this in Chapter 13, when 
exploring specific ways of measuring the information in a data set. 
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10.4 Calibration and refinement 

10.4.1 The well-calibrated forecaster 

In this section we discuss calibration and refinement, and their relation to scoring 
rules. Our discussion follows DeGroot and Fienberg (1982) and Seidenfeld (1985). 

A set of probabilistic forecasts tc is well calibrated if n% of all predictions 
reported at probability i r are true. In other words, calibration looks at the agree¬ 
ment between relative frequency for the occurrence of an event (rain in the above 
context) and forecasts. Here we are using tt for forecasts, and assume that n = q. To 
illustrate calibration, consider Table 10.1, adapted from Brier (1950). A forecaster is 
calibrated if the two columns in the table are close. 

Requiring calibration in a forecaster is generally a reasonable idea, but calibration 
is not sufficient for the forecasts to be useful. DeGroot and Fienberg point out that: 

In practice, however, there are two reasons why a forecaster may not be 
well calibrated. First, his predictions can be observed for only a finite 
number of days. Second, and more importantly, there is no inherent rea¬ 
son why his predictions should bear any relation whatsoever to the actual 
occurrence of rain. (DeGroot and Fienberg 1983, p. 14) 

If this comment sounds harsh, consider these two points: first, a calibration scoring 
rule based on calibration alone would, in a finite horizon, lead to forecasts whose 
only purpose is to game the system. Suppose a weather forecaster is scored at the 
end of the year based on how well calibrated he or she his. For example, we could 
take all days where the announced chance of rain was 10% and see whether or not 
the empirical frequency is close to 10%. Next we would do the same with 20% 
announced chance of rain and so on. The desire to be calibrated over a finite time 
period could induce the forecaster to make announcements that are radically at odds 
with his or her beliefs. Say that, towards the end of the year, the forecaster is finding 
that the empirical frequency for the 10% days is a bit too low—in the table it is 7%. 
If tomorrow promises to bring a flood of biblical proportions, it is to the forecaster’s 
advantage to announce that the chance of rain is 10%. The forecaster’s reputation 
may suffer, but the scoring rule will improve. 


Table 10.1 Verification of a series of 85 forecasts expressed in terms 
of the probability of rain. Adapted from Brier (1950). 


Forecast probability of rain 

Observed proportion of rain cases 

0.10 

0.07 

0.30 

0.10 

0.50 

0.29 

0.70 

0.40 

0.90 

0.50 
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Second, even when faced with an infinite sequence of forecasts, a forecaster has 
no incentive to attempt at correlating the daily forecasts and the empirical observa¬ 
tions. For example, a forecaster that invariably announces that the probability of rain 
is, say, 10%, because that is the likely overall average of the sequence, would likely 
be very well calibrated. These forecasts would not, however, be very useful. 

So, suppose you have available two well-calibrated forecasters. Whose predic¬ 
tions would you choose? That is of course why we have scoring rules, but before 
we tie this discussion back to scoring rules, we describe a measure of forecasters’ 
accuracy called refinement. Loosely speaking, refinement looks at the dispersion of 
the forecasts. 

Consider a series of forecasts of the same type of event over time, as would be the 
case if we were to announce the probability of rain on the daily forecast. Consider 
a well-calibrated forecaster who announces discrete probabilities, so there is only 
a finite number of possibilities, collected in the set n—for example, n could be 
{0,0.1,0.2,..., 1}. We can single out the days when the forecast is a particular n 
using the sequence of indicators Consider a horizon consisting of the 

first n k elements in the sequence. Let n\ denote the number of forecasts equal to n 
within the horizon. The proportion of days in the horizon with forecast n is 


v k (t t) = n\!n k . 


Let Xi be the indicator of rain in the zth day. Then 


k 


x k {tt) = y] 


is the relative frequency of rain among those days in which the forecaster’s prediction 
was 7T. The relative frequency of rainy days is 



We assume that 0 < p. < 1. 

To simplify notation in what follows we drop the index k, but all functions 
are defined considering a finite horizon of k days. Before formally introducing 
refinement we need the following definition: 

Definition 10.4 (Stochastic transformation) A stochastic transformation h(jt \p) 
is a function defined for all jt e FI and p e TI such that 


Hx \p) > 0, V 7r, p e n 
^/?(:r|p)=l, Vpen. 


Definition 10.5 (Refinement) Consider two well-calibrated forecasters A and B 
whose predictions are characterized by v A and v B , respectively. Forecaster A is at 
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least as refined as [or, alternatively, forecaster A is sufficient for] forecaster B if 
there exists a stochastic transformation h such that 

^/i(jr|p)v A (p) = v B (jr), Viren 

pen 

y, h(jt\p)pv A (p) = jtv b (jt), V jt e n. 

pen 

In other words, if forecaster A makes a prediction p, we can generate B’s prediction 
frequencies by utilizing the conditional distribution h(jx\p). 

Let us go back to the forecaster that always announces the overall average, that is 


V(7T) 



It" //, 

if it ^ pt. 


We need to assume that p e n—not so realistic if predictions are highly dis¬ 
crete, but convenient. This forecaster is the least refined forecaster. Alternatively, 
we say the forecaster exhibits zero sharpness. Any other forecaster is at least as 
refined as the least refined forecaster because we can set up the following stochastic 
transformation: 


h(p\p) = 1, forp e n 
h(7t\p) = 0, for 7r ^ p 


and apply Definition 10.5. 

At another extreme of the spectrum of well-calibrated forecasters is the forecaster 
whose prediction each day is either 0 or 1: 


v(tt) = 


p, if n = 1 

(1 — p), if n — 0 
0, ifjr£{0,1}. 


Because this forecaster is well calibrated, his or her predictions have to also be 
always correct. This is the most refined forecaster is said to have has perfect sharp¬ 
ness. To see this, consider any other forecaster B, and use Definition 10.5 with the 
stochastic transformation 

7r 

h(jt 11) = — v b (jt), for 7r e n, 

p 

l — JT 

h(jt\0) = - v B (7r), for:r e n. 

1 — p 

DeGroot and Fienberg (1982, p. 302) provide a result that simplifies the 
construction of useful stochastic transformations. 
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Theorem 10.2 Consider two well-calibrated forecasters A and B. Then A is at least 
as refined as B if and only if 

i -1 

- v fl (jr (0 )] >0, Vy = 11, 

i=0 

where J is the number of elements in n and n (() is the ith smallest element ofll. 

DeGroot and Fienberg (1982) present a more general result that compares forecasters 
who are not necessarily well calibrated using the concept of sufficiency as introduced 
by Blackwell (1951) and Blackwell (1953). 

At this point we are finally ready to explore the relationship between proper 
scoring rules and the concepts of calibration and refinement. Let us go back to the 
quadratic scoring rule, or Brier score, which in the setting of this section is 

BS = v ( n ) [x(tt)(it — l) 2 + (1 — . 

7T€n 


This score can be rewritten as 

BS = ^2 v ( Jt ) ~ ^C 71 ")] 2 + ^2 y ( 7r ) — ^C 71 "))] ■ (10.8) 

Tren ttgEI 


The first summation on the right hand side of the above equation measures the dis¬ 
tance between the forecasts and the relative frequency of rainy days; that is, it is a 
measure of the calibration of the forecaster. This component is zero if the forecast 
is well calibrated. The second summation term is a measure of the refinement of 
the forecaster and it shows that the score is improved with values of x(jt) close 
to 0 or 1. It can be approximately interpreted as a weighted average of the bin- 
specific conditional variances, where bins are defined by the common forecast it. 
This decomposition illustrates how the squared error loss combines elements of cal¬ 
ibration and forecasting, and is the counterpart of the bias/variance decomposition 
we have seen in equation (7.19) with regard to parameter estimation under squared 
error loss. 

DeGroot and Fienberg (1983) generalize this beyond squared error loss: sup¬ 
pose that the forecaster’s subjective prediction is n and that he or she is scored 
on the basis of a strictly proper scoring rule specified by functions ,v, and s 2 , 
which are, respectively, decreasing and increasing functions of tt. The forecaster’s 
score is .v i (jt ) if it rains and s 2 (jt) otherwise. The next theorem tells us that 
strictly proper scoring rules can be partitioned into calibration and refinement 
terms. 

Theorem 10.3 Suppose that the forecaster’s predictions are characterized by func¬ 
tions x(n) and v(jt) with a strictly proper scoring rule specified by functions sfn) 
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and s 2 (jt). The forecaster’s overall score is S which can be decomposed as S = Si+S 2 
where 


51 — ^2 v ( Jr )M 7r )lAl( 7r ) - Sl(*0r))] + [1 - x(7t)][s 2 (7T) - JjWJT))]} 

7ren 

5 2 = ^2 v(TT)<p{x(n)) 

7ren 

with (pip) — psiip ) + (1 — p)s 2 (p), 0 < p < l, a strictly convex function. 


Note that in the above decomposition. Si is a measure of the forecaster’s calibration 
and achieves its minimum when x(tt) = tt, V tt\ that is, when the forecaster is well 
calibrated. On the other hand, S 2 is a measure of the forecaster’s refinement. 

Although scoring rules can be used to compare probability assessments provided 
by competing forecasters, or models, in practice it is more common to separately 
assess calibration and refinement, or calibration and discrimination. Calibration is 
typically assessed by the bias term in BS, or the chi-square statistic built on the 
same two sets of frequencies. Discrimination is generally understood as the ability 
of a set of forecasts to separate rainy from dry days and is commonly quantified 
via the receiver operating characteristics (ROC) curve (Lusted 1971, McNeil et al. 
1975). A ROC curve quantifies in general the ability of a quantitative or ordinal 
measurement and scores (possibly a probability) to classify an associate binary 
measurement. 

For a biological example, consider a population of individuals, some of whom 
have a binary genetic marker x = 1 and some of whom do not. A forecaster, or 
prediction algorithm, gives you a measurement tc that has cumulative distribution 

F x in) = v(jt') 

7t'<JT 


conditional on the true marker status x. In a specific binary decision problem we may 
be able to derive a cutoff point tt 0 on the forecast and decide to “declare positive” 
all individuals for whom tt > n 0 . This declaration will lead to true positives but 
also some false positives. The fraction of true positives among the high-probability 
group, or sensitivity, is 


jS(TTo) = 


v(n')x(7T) 

E*^ V ( jr ) 


while the fraction of true negatives among the low-probability group, or specificity, is 


a(jr 0 ) = 


E*^ v(jt) 


Generally a higher cutoff will decrease the sensitivity and increase the specificity, 
similarly to what happens to the coordinates of the risk set as we vary the cutoff on 
the test statistic in Section 7.5.1. 
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The ROC curve is a graph of / 8 ( jt 0 ) versus 1 — a(jr 0 ) as we vary jt 0 . Formally the 
ROC curve is given by the equation 


P= 1 -F,[l-Fo(l -«)]. 


The overall discrimination of a classifier is often summarized across all pos¬ 
sible cutoffs by computing the area under the ROC curve, which turns out to 
be equal to the probability that a randomly selected person with the genetic 
marker has a forecast that is greater than a randomly selected person without the 
marker. 

The literature on measuring the quality of probability assessments and the utility 
of predictive assays in medicine is extensive. Pepe (2003) provides a broad overview. 
Issues with comparing multiple predictions are explored in Pencina et al. (2008). An 
interesting example of evaluating probabilistic prediction in the context of specific 
medical decisions is given by Gail (2008). 

10.4.2 Are Bayesians well calibrated? 

A coherent probability forecaster expects to be calibrated in the long run. Consider 
a similar setting to the previous section and suppose in addition that, subsequent to 
each forecast n n the forecaster receives feedback in the form of the outcome x, of the 
predicted event. Assume that jr, +1 represents the posterior probability of rain given 
the prior history x t , ... ,x,, that is 


n i+ 1 =7t{x i+l \x u ...,x i ). 


One complication now is that we can no longer assume that the forecasts live in a 
discrete set. One way to proceed is to select a subset, or subsequence, of days—for 
example, all days with forecast in a given interval—and compare forecasts jr, with 
the proportion of rainy day in the subset. To formalize, let e, be the indicator for 
whether day i is included in the set. Define 


k 


n k = ^2 e ' 


to be the number of selected days within horizon k , 


k 



to be the relative frequency of rainy days within the subset, and 


k 



(=1 
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as the average forecast probability within the tested subsequence. In this setting we 
can define long-run calibration as follows: 

Definition 10.6 Given a subsequence such that e, = 1 infinitely often, the forecaster 
is calibrated in the long run if 


lim (x k — n k ) — 0 

k—> oo 


with probability 1. 

The condition that c, = 1 infinitely often is required so we never run out of events to 
compare and we can take the limit. To satisfy it we need to be careful in constructing 
subsets using a condition that will keep coming up. 

This is a good point to remind ourselves that, from the point of view of the 
forecaster, the expected value of the forecaster’s posterior is his or her prior. For 
example, at the beginning of the series 

E X] [7r 2 |x,] = tv(x 2 \xi = x)m(x ) = n{xf) = 7Ci 


and likewise at any stage in the process. This follows from the law of total probability, 
as long as m is the marginal distribution that is implied by the forecaster’s likelihood 
and priors. So, as a corollary 


E Xi [jt i+l |x,] - tv i = 0, 

a property that makes the sequence of forecasts a so-called martingale. So if the data 
are generated from the very same stochastic model that is used in the updating step 
(that is, Bayes’ rule) then the forecasts are expected to be stable. Of course this is a 
big if. 

In our setting, a consequence of this is that in the long run, a coherent forecaster 
who updates probabilities according to a Bayes rule expects to be well calibrated 
almost surely in the long run. This can be formalized in various ways. Details of 
theorems and proofs are in Pratt (1962), Dawid (1982), and Seidenfeld (1985). The 
good side of this result is that any coherent forecaster is not only internally consis¬ 
tent, but also prepared to become consistent with evidence, at least with evidence 
of the type he or she expects will accumulate. For example, if the forecaster’s prob¬ 
abilities are based on a parametric model whose parameters are unknown and are 
assigned a prior distribution, then, in the long run, if the data are indeed generated 
by the postulated model, the forecaster will learn model parameters no matter what 
the initial prior was, as long as a positive mass was assigned to the correct value. 
Also, any two coherent forecasters that agree on the model will eventually give very 
similar predictions. 

Seidenfeld points out that this is a disappointment for those who may have hoped 
for calibration to provide an additional criterion for restricting the possible range of 
coherent probability specifications: 
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Subject to feedback, calibration in the long run is otiose. It gives no 
ground for validating one coherent opinion over another as each coherent 
forecaster is (almost) sure of his own long-run calibration. (Seidenfeld 
1985, p. 274) 

We are referring here to expected calibration by one’s own model. In reality dif¬ 
ferent forecasters will use different models and may never agree with each other or 
with evidence. Dawid is concerned by the challenge this poses to the foundations of 
Bayesian statistics: 

Any application of the Theorem yields a statement of the form tc(A) = 1, 
where A expresses some property of perfect calibration of the distribu¬ 
tion it. In practice, however, it is rare for probability forecasts to be well 
calibrated (so far as can be judged from finite experience) and no realistic 
forecaster would believe too strongly in his own calibration performance. 

We have a paradox: an event can be distinguished... that is given sub¬ 
jective probability one, and yet is not regarded as “morally certain”. 

How can the theory of coherence, which is founded on assumptions of 
rationality, allow for such irrational conclusions? (Dawid 1982, p. 607) 

Whether this is indeed an irrational conclusion is a matter of debate, some of which 
can be enjoyed in the comments to Dawid’s paper, as well as in his rejoinder. 

Just as in our discussion of Savage’s “small worlds,” here we must keep in mind 
that the theory of rationality needs to be circumscribed to a small enough, reasonably 
realistic, microcosm, within which it is humanly possible to specify probabilities. 
For example, specifying a model for all possible future meteorological data in a 
broad sense, say including all new technological advances in measurement, climate 
change, and so forth, is beyond human possibility. So the “small world” may need 
to evolve with time, at the price of some reshaping of the probability model and a 
modicum of temporal incoherence. Morrie DeGroot used to say that he carried in 
his pocket an e of probability for complete surprises. He was coherent up to that e! 
His € would come in handy in the event that a long-held model turned out not to be 
correct. Without it, or the willingness to understand coherence with some flexibility, 
we would be stuck for life with statistical models we choose here and now because 
they are useful, computable, or currently supported by a dominant scientific theory, 
but later become obsolete. How to learn statistical models from data, and how to try 
to do so rationally, is the matter of the next chapter. 


10.5 Exercises 

Problem 10.1 A forecaster must announce probabilities q = (q t ,..., q,) for the 
events These events form a partition: that is, one and only one of them will 

occur. The forecaster will be scored based on the scoring rule 
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k 


s(0j,q) = J2 I Vi ~ M- 


Here l i=j is 1 if i = j and 0 otherwise. Let n = (jt 1 , , iCj) represent the forecaster’s 

own probability for the events d { ,... ,6j. Show that this scoring rule is not proper. 
That is, show that there exists a vector q ^ it such that 


j 


j 



Because you are looking for a counterexample, it is okay to consider a simplified 
version of the problem, for example by picking a small J. 

Problem 10.2 Show directly that the scoring rule s(6 n q) = c/ ( is not proper. See 
Winkler (1969) for further discussion about the implications of this fact. 

Problem 10.3 Suppose that 9\ and 0 2 are indicators of disjoint events (that is, 
Ofi-, — 0) and consider 0 3 — 0 t + 0 2 as the indicator of the union event. You 
have to announce probabilities q e , q , h , and q 9i H> , to these events, respectively. You 
will be scored according to the quadratic scoring rule: s(0i,6 2 ,0 2 ,q ei ,qg 2 ,q e) ) = 
(q 6l - @i) 2 + ( qg 2 - d 2 f + (q e3 - 9 3 ) 2 . Prove the additive law, that is q^ = q Bl + q„ 2 . 


Hint: Calculate the scores for the occurrence of 6i( 1 — 0 2 ), (1 — 0\)0 2 , and 
(1 - 00(1 - 9 2 ). 


Problem 10.4 Show that the area under the ROC curve is equal to the probability 
that a randomly selected person with the genetic marker has an assay that is greater 
than a randomly selected person without the marker. 

Problem 10.5 Prove Equation (10.8). 

Problem 10.6 Consider a sequence of independent binary events with probabil¬ 
ity of success 0.4 Evaluate the two terms in equation (10.8) for the following four 
forecasters: 

Charles: always says 0.4. 

Mary: randomly chooses between 0.3 and 0.5. 

Qing: says either 0.2 or 0.3; when he says 0.2 it never rains, when he says 0.3 it 
always rains. 

Ana: follows this table: 


Rain No rain 


7r = 0.3 0.15 

it = 0.5 0.25 


0.35 

0.25 


Comment on the calibration and refinement of these forecasters. 
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Choosing models 


In this chapter we discuss model choice. So far we postulated a fixed data-generating 
mechanism f{x\9) without worrying about how / is chosen. From this perspective, 
we may think about model choice as the choice of which / to use. Depending on 
the application, / could be a simple parametric family, or a more elaborate model. 
George Box’s famous comment that “all models are wrong but some models are 
useful” (Box 1979) highlights the importance of taking a pragmatic viewpoint in 
evaluating models, and to set criteria driven by the goals of the modeling. Decision 
theory would seem to be the perfect perspective to formalize Box’s concise statement 
of principle. 

A view we could take is that of Chapter 10. Forecasters are incarnations of pre¬ 
dictive models that we can evaluate and compare based on utility functions such as 
scoring rules. If we do this, we neatly separate the information that was used by 
the forecasters to develop and tune the prediction models, from the information that 
we use to evaluate them. This separation is a luxury we do not always have. More 
often we would like to be able to entertain several approaches in parallel, and learn 
something about how well they do directly from the data that are used to develop 
them. Whether this is even possible is a matter of debate, and some hold, with good 
reasons, that model training and model assessment should be separate. 

But let us say we give in to the temptation of training and evaluating models at 
the same time. An important question is whether we can talk about a model as a 
“state of the world” in the same way we did for parameters or future events. Box’s 
comment that all models are wrong sounds like a negative answer. In the terminology 
of Savage, a model is perhaps like a “small world.” Within the small world we apply 
a theory that explains how we should learn from data and make good decisions. But 
can the theory tell us whether the “small world” is right? 

Both these considerations suggest that model choice may require a richer 
conceptual framework than that of statistical decision theory. However, we can still 
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make a little bit of progress if we are willing to stipulate that a true model exists in 
our list of candidate models. This is not a real change of perspective conceptually: 
the “small world” is a little bit bigger, and the model is then simply another parame¬ 
ter, but the results are helpful in clarifying the underpinning of some popular model 
choice approaches. 

In Section 11.1 we set up the general framework for decision problems in which 
the model is unknown, and look at the implications of model selection and prediction. 
Then, changing slightly the paradigm, we consider the situation in which only one 
model is being contemplated and the question arises as to whether or not the model 
may be adequate. We present a way in which decision theory can can be brought 
to bear for this problem in Section 11.2. We focus only on the Bayesian approach, 
though frequentist model selection approaches are also available. 

We do not have a featured article for this chapter. Useful general readings are 
Box (1980), Bernardo and Smith (1994), and Clyde and George (2004). 


11.1 The “true model” perspective 

11.1.1 Model probabilities 

Our discussion in most of this chapter is based on making our “small world” just a 
little bit bigger so that unknowns that are not normally considered part of it are now 
included. For example, we can imagine extending estimation of a population mean 
from a known to an unknown family of distributions, or extending a linear regres¬ 
sion with two predictors to the bigger world in which any subset of five additional 
predictors could also be included in the model. We will assume we can make a list 
of possible models, and feel comfortable that one of the models is true—much as 
we could be confident that one real number or another is the true average height of 
a population. When we can do this, the model is part of the states of the world in 
the usual sense, and nothing differentiates it from what we normally call 9 except 
habit. We can then apply all we know about decision theory, and handle the fact that 
the model is unknown in a goal-driven way. This approach takes very seriously the 
“some models are useful” part of Box’s aphorism, but it ignores the “all models are 
wrong” part. 

Formally, we can take our familiar f(x\9) and consider it as a special case of 
a larger collection of possible data-generating mechanisms defined by f(x\0 M , M), 
where M denotes the model. We have a list of models, called Ad. To fix ideas, con¬ 
sider the original / to be a normal with mean 9 and variance 1. If you do not trust 
the normal model but you are reasonably confident the distribution should be sym¬ 
metric, and your primary concerns are occasional outliers, then you can consider 
the set of Student’s t distributions with M degrees of freedom and median 9 to be 
your new larger collection of models. If Ad is all positive integers, the normal case 
is approached at M = oo. M is much like 9 in that it is an unknown state of the 
world. The different denomination reflects the common usage of the normal as a 
fixed assumption (or model) in practical analyses. 
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In this example, the parameter 6 can be interpreted as the population median 
across all models. However, the interpretation of 0, as well as its dimensionality, 
may change across models in more complex examples. This is why we need to define 
separate random variables 0 M for each M. 

In general, we have a model index M, and as many parameter sets as there are 
models, potentially. These are all unknown. The axiomatic foundations tell us that, 
if the small world has not exploded on us yet, we should have a joint probability 
distribution on the whole set. If we have a finite list of models, so M is an integer 
between 1 and M 0 , the prior is 


7 T(d u ...,9 MB ,M). (11.1) 

This induces a joint distribution over the data, parameters, and models. Implicit in the 
specification of f{x\Q M ,M) is the idea that, conditional on M and 0 M , x is independent 
of all the 6 m i with M' ^ M. So the joint probability distribution on all unknowns can 
be written as 


This can be a very complicated distribution to specify, though for special decision 
problems some simplifications take place. For example, if all models harbor a com¬ 
mon parameter 0, and the loss function depends on the parameters only through 0, 
then we do not need to specify the horrendous-dimensional prior (11.1). We define 
0 M = (0, )/„), where rj M are model-specific nuisance parameters. Then 


tt(9\x) 


1 

m(x) 



f{x\9, t) m ,M)jv( 0, r]i,, rj Mo ,M)dih,.. .,drj Mo 


1 

mix) 



fix\0, iim,M)tt(6, r] M ,M)dii M , 


( 11 . 2 ) 


where jt(6, r\ M ,M) is the model-specific prior defined by 
7 r(6,ri M ,M) = 


I f I 

J Hi J H m _ i J H m+ i 


/ 7 T(0,T) u ...,r) Mo ,M)drii,...,dri M -idri M+ i...,dTi Mo . (11.3) 

Jh m 


To get equation (11.2) reorder each of the M 0 integrals so that the integral with 
respect to rj M is on the outside. 

In this case, because of Theorem 7.1, we can operate directly from the distribu¬ 
tion tt(0|x). Compared to working from prior (11.1), here we “only” need to specify 
model-specific priors it (6, 

Another important quantity is the marginal distribution of the data given the 
observed model. This will be the critical quantity to look at when the loss function 
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depends on the model but not specifically on the model parameters. An integration 
similar to that leading to equation (11.2) will produce 

f(x\M)= [ f{x\6 M ,M)jt{0 M \M)d0 M (11.4) 

which again depends only on the model-specific conditional prior tt(Q m \M). Then, 
conditional on the data x, the posterior model probabilities are given by 


7T(M\x) = 


f(x\M)7T(M) 

mix) 


(11.5) 


where tc(M) is the prior model probability implied by (11.1). Moreover, the posterior 
predictive density for a new observation x is given by 


M 0 

f(x\x) = y^j(x\x, M)Tt(M\x)\ (11.6) 

M= 1 


that is, a posterior weighted mixture of the conditional predictive distributions 

f(5c\x,M)- 

The approach outlined here is a very elegant way of handling uncertainty about 
which model to use: if the true model is unknown, use a set of models, and let the 
data weigh in about which models are more likely to be the true one. In the rest 
of this section we will look at the implications for model choice, prediction, and 
estimation. Two major difficulties with this approach are the specification of prior 
probabilities and the specification of a list of models large enough to contain the true 
model, but small enough that prior probabilities can be meaningful. There are also 
significant computational challenges. For more readings about the Bayesian perspec¬ 
tive on model uncertainty see Madigan and Raftery (1994), Draper (1995), Clyde and 
George (2004), and references therein. Bernardo and Smith (1994) also discuss alter¬ 
native perspectives where the assignment of probabilities to the model space is not 
logical, because it cannot be assumed that the true model is included in the list. 


11.1.2 Model selection and Bayes factors 

In model choice the goal is to choose a single “best” model according to some speci¬ 
fied criterion. The simplest formulation is one in which we are interested in guessing 
the correct model, and consider all mistakes equally undesirable. Within the decision- 
theoretic framework the set of actions is .4 = A4 and the loss function which is 0 
for choosing the true model and 1 otherwise. To simplify our discussion we assume 
that M is a finite set, that is A4 = {1,... ,M 0 ). This loss function allows us to frame 
the discussion in a decision-theoretic way and get some clear-cut results, but it is not 
very much in the spirit of the Box aphorism: it really punts on the question of why the 
model is useful. In this setting the prior expected loss for action that declares M to be 
the true model is 1 — tt(M) and the Bayes action a* is to choose the model with high¬ 
est prior probability. Similarly, after seeing observations x, the optimal decision is 
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to choose the model with highest posterior probability (11.5). Many model selection 
procedures are motivated by the desire to approximate this property. However, often 
they are used in practice to select a model and then perform inference or predic¬ 
tion conditioning on the model. This practice is expedient and often necessary, but 
a more consistent decision-theoretic approach would be to specify the loss function 
directly in terms of the final use of the model. We will elaborate on this in the next 
section. 

Given any two models M and M', 


ir(M\x) 

" 7T (M) " 

|7WM)1 

tc(M'\x) 

_t r(M'). 

J(x\M')_ 


that is, the ratio between posterior probabilities for models M and M' is the product of 
the prior odds ratio and the Bayes factor. We discussed Bayes factors in the context of 
hypothesis testing in Chapter 7. As in hypothesis testing, the Bayes factor measures 
the relative support for M versus M' as provided by the data x. Because of its relation 
to posterior probabilities, choosing a model on the basis of the posterior probability 
is equivalent to choosing a model using Bayes factors. 

In contrast to model selection where a true model is assumed and the utility 
function explicitly seeks to find it, in model comparison we are simply inter¬ 
ested in quantifying the relative support that two models receive from the data. 
The literature on model comparison is extensive and Bayes factors play an impor¬ 
tant role. For an extensive discussion on Bayes factors in model comparison, see 
Kass and Raftery (1995). Alternatively, the Bayesian information criterion (BIC) 
(or Schwarz criterion) also provides a means for comparing models. The BIC is 
defined as 


BIC*, = 21ogsup/(jc|0 M ,M) — dlogn, 

e M 

where d is the number of parameters in model M. A crude approximation to the 
Bayes factor for comparing models M and M' is 


exp 


— (BIC m — BIC M ') 


Spiegelhalter et al. (2002) propose an alternative criterion known as the 
deviance information criterion for model checking and comparison, which can be 
applied to complex settings such as generalized linear models and hierarchical 
models. 


11.1.3 Model averaging for prediction and selection 

Let us now bring in more explicitly the typical goals of a statistical analysis. We 
begin with prediction. Each model in our collection specifies a predictive density 


f(x\x,d M ,M) 
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for a future observation x. If the loss function depends only on our actions and x, then 
model uncertainty is taken into account by model averaging. Similarly to the previous 
section, we first compute the distribution f(x\x), integrating out both models and 
parameters, and then attack the decision problem by minimizing posterior expected 
loss. For squared error loss, point predictions can be expressed as 



(11.7) 


a weighted average of model-specific prediction, with weights given by posterior 
model probabilities. If, instead, we were to decide on a predictive distribution, using 
the negative log loss function of Section 10.3, the optimal predictive distribution 
would bef(x\x), which can also be expressed as a weighted average of model-specific 
densities. 

Consider now a slightly different case: the decision maker is uncertain about 
which model is true, but must make predictions based on a single model. This 
applies, for example, when the model predictions must be produced in a setting 
where the model averaging approach is not computationally feasible. Formally, the 
decision maker has to choose a model M in {1,... ,M 0 } and, subsequently, make a 
prediction for a future observation x based on data x, and assuming model M is true. 
We formalize this decision problem and denote a = (a, b) as the action where we 
select model a and use it to make a prediction b. For simplicity, we consider squared 
error loss 


L(a,x ) = (b — x) 2 . 


This is a nested decision problem. The model affects the final loss through the con¬ 
straints it imposes on the prediction b. There is some similarity between this and the 
multistage decision problems we will encounter in Part Three, though a key differ¬ 
ence here is that there is no additional data acquisition between the choice of the 
model and the prediction. 

For any model a, the optimal prediction rule is 



Plugging this solution back into the posterior expected loss function, the optimal 
model a* minimizes 
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Working on the integral above we get 

f (8* a (x)-x) 2 f(x\x)dx= [ (8:(x)-xf 
J X J X 


M 0 


n(M\x)f(x\x, M) 


dx 


"'U /> 

= X«/ [( a ;w - w) 2 + cw - *) 2 ]/tfi*, 

M=1 ^ 

M 0 

= X [(£(*) - W) 2 + Var[x| M,x]\. 


The above expression depends on model a only through the first element of the sum 
in square brackets. The optimal model minimizes the weighted difference between 
its own model-specific prediction and the predictions of the other possible models, 
with weights given by the posterior model probabilities. 

We can also compare this to S*(x) = Yl'uLi |x), the posterior averaged 

prediction. This is the global optimum when one is allowed to use predictions based 
on all models rather than being constrained to using a single one, and coincides with 
decision rule (11.7). Manipulating the first term of posterior expected loss above a 
little further we get 


M 0 M 0 

X^(M|x)(5;(x) - 8* M (x)) 2 = (8* a (x) - 8*(x)) 2 + X<W - 8*(x)) 2 it(M\x). 

M= 1 M= 1 

Thus, the best model gives a prediction rule 8* closest to the posterior averaged 
prediction 8*. 

We now move to the case when the decision is about the entire prediction density. 
Let a = (a, b) denote the action where we choose model a and subsequently a density 
b as the predictive density for a future observation x. Let L{a, x) denote the loss func¬ 
tion. We assume that such a loss function is a proper scoring rule, so that the optimal 
choice for a predictive density is in fact the actual belief, that is <5*(x) = f(x\x, a). 
Then the posterior expected loss of choosing model a and proceeding optimally is 

' Mo 

L(a,f(- \x, a), x)f(x\x, M)n(M\x) 

,M= i 

The optimal strategy is to choose the model that minimizes equation (11.8). San 
Martini and Spezzaferri (1984) consider the logarithmic loss function corresponding 
to the scoring rule of Section 10.3. In our context, this implies that model M is 
preferred to model M' iff 

f X lQ g ( fri |X ’^ )^ |M " )7r(M " |x)d ~ x > °- 

Jx^ \f(x\x,M')J 


C{a,8*) = 


-L 


\dx. (11.8) 
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San Martini and Spezzaferri (1984) further develop this choice criterion in the 
case of two nested linear regression models, and under additional assumptions 
regarding prior distributions observe that the criterion takes the form 


L R — k(d M — z/.v/'), 


(11.9) 


where LR is the likelihood ratio statistic, d M is the number of regression parameters 
in model M, and 



with n denoting the number of observations. Equation (11.9) resembles the compar¬ 
ison resulting from the AIC (Akaike 1973) for which k = 2 and the BIC (Schwartz 
1978) where k = log(n). Poskitt (1987) provides another decision-theoretic develop¬ 
ment for a model selection criterion that resembles the BIC, assuming a continuous 
and bounded utility function. 


11.2 Model elaborations 


So far our approach to questioning our “small world” has been to make it bigger and 
deal with the new unknowns according to doctrine. In some settings, this approach 
also gives us guidance on how to make the world small again—by focusing back on 
a single model, perhaps different from the one we started out with. Here we briefly 
consider a different perspective, which is based on looking at small perturbations 
(or “elaborations”) of the small world that are designed to explore whether bigger 
worlds are likely to change our behavior substantially, without actually having to 
build a complete probabilistic representation for those. 

In more statistical language, we are interested in model criticism—we plan by 
default to consider a specific model and we wish to get a sense for whether this 
choice may be inadequate. In the vast majority of applications this task is addressed 
by a combination of significance testing, typically for goodness of fit of the model, 
and exploratory data analysis, for example examination of residuals. For a great dis¬ 
cussion and an entry point to the extensive literature see Box (1980). In this paper, 
Box describes scientific learning as “an iterative process consisting of Criticism and 
Estimation,” and holds that “sampling theory is needed for exploration and ultimate 
criticism of entertained models in the light of data, while Bayes’ theory is needed for 
estimation.” An interesting related discussion is in Gelman et al. (1996). 

While we do find much wisdom in Box’s comment, in this section, we look at 
how traditional Bayesian decision theory can be harnessed to criticize a model, and 
also we revisit traditional criticism metrics from a decision perspective. In our dis¬ 
cussion, model M will be the model initially proposed and the question is whether or 
not the decision maker should embark on more complex modeling before carrying 
out the decision analysis. A simple approach, reflecting some statistical practice, is to 
set up, for probing purposes, a second model M' and compare it to M. Bernardo and 
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Smith (1994) propose to choose an appropriate loss function for model evaluation, 
and look at the change in posterior expected loss as an indication of the worthiness 
of M' compared to M. A related approach is described in Carota et al. (1996) who 
use model elaboration to estimate the change in utility resulting from a larger model 
in a neighborhood of M. 

If the current model holds, the joint density of the observed data x and unobserved 
parameters 6 is 


f{x,0\M)=f{x\e,M)7t(0\M). 

To evaluate model M we embed it in a class of models Ai, called model elaboration 
(see Box and Tiao 1973, Smith 1983, West 1992). For example, suppose that in the 
current model data are exponential. To elaborate on this model we may consider Ai 
to be the family of Weibull distributions. The parameter <f> will index models in Ai 
so that 


f{x,9,<p\M) = f{y\9,(p,M)7t(6\(p,M)7t((p\M). 

The idea of an elaboration is that the original model M is still a member of Ai for 
some specific value <p M of the elaboration parameter. In the model criticism situa¬ 
tion, the prior distribution Tt(ip\Ai) is concentrated around to reflect the initial 
assumption that M is the default model. One way to think of this prior is that it 
provides a formalization of DeGroot’s pocket e from Section 10.4.2. 

To illustrate, let y\9,M ~ N(9,o 2 /n), and 0\M ~ /V(/x 0 , r 0 2 ) where (a 2 , r 0 2 ) are 
both known. A useful elaboration Ai is defined by 

y\9,4>, Ai ~ N(9,a 2 /n(f>) (11.10) 

0\</),M~ N(/j, 0 ,r 2 /(p) 

(p\Ai ~ Gamma(v/ 2, l/2r 2 ). 

The elaboration parameter </> corresponds to a variance inflation factor, and <p M — 1. 
West (1985) and Efron (1986) show how this generalizes to the case where M is an 
exponential family. 

For another illustration, a general way to connect two nonnested models M and 
M' defined by densities f(x\M) and g(x\M') is the elaboration 

x\(j>,n ~ c(0)/(-*|M)*g(x|M') 1- * 

where cUj)) is an appropriate normalization function. See Cox (1962) for further 
discussion. 

Given any elaboration, model criticism can be carried out by comparing the orig¬ 
inal and elaborated posterior expected losses, by comparing the posteriors n(9\x,M) 
to tt(9\x, 4> , Ai), or, lastly, by comparing ir((p\x,Ai) to jr((/)\Ai). Carota et al. (1996) 
consider the latter for defining a criticism measure, and define a loss function for 
capturing the distance between the two distributions. Even though this loss is not 
the actual terminal loss of the problem we started out with, if the data change the 
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marginal posterior of (p by a large amount, it is likely that model M will perform 
poorly for the purpose of the original loss as well. To simplify notation, we drop A4 
in the above distributions. 

Motivated by the logarithmic loss function of Section 10.3, we define the 
diagnostic measure as 


A — E^ x 



/ 7T (</>!*) \\ 

v *(0);; ■ 


(li.iD 


A is the Kullback-Leibler divergence between the prior and posterior distributions 
of 0. In a decision problem where the goal is to choose a probability distribution on 
( p , and the utility function is logarithmic, it measures the change in expected utility 
attributable to observing the data. We will return to this point in Chapter 13 in our 
discussion of the Lindley information. 

Low values of A indicate agreement between prior and posterior distributions 
of the model elaboration parameter 4> validating the current model M. Interpreting 
high values of A is trickier, though, and may require the investigation of jr(<p\x). 
If the value of A is large and n(<j>\x) is peaked around </> 0 , the value for which the 
elaborated model is equal to M, then model M is adequate. Otherwise, it indicates 
that model M is inappropriate. 

Direct evaluation of equation (11.11) is often difficult. As an alternative, the func¬ 
tion A can be computed either by an approximation of the prior and the posterior, 
leading to analytical expressions for A, or by using a Monte Carlo approach (Muller 
and Parmigiani 1996). Another possibility is to consider a linearized diagnostic mea¬ 
sure A l which approximates A when the prior on ([> is peaked around 0 M . To derive 
the linearized diagnostic A, , observe that 


A — E^ x 



ywm 

, m{x) )) 


Now, expanding log f(x\<p) about (p M we have 


log/(-r|0) = log f(x\<p M ) + (<p - <p M ) 


T7 log/(a#) 


-R(4>) 


<t>=<t>M 


for some remainder function R(.). The linearized version A, is defined as 


f(x\<b M ) 3 

Al = log —— + \M ~ <Pm) — log f(x\<j)) 
m(x) o(p 


( 11 . 12 ) 




A l combined three elements all of which are relevant model criticism statistics in 
their own right: the Savage density ratio defined by 

/ww. 

m(x ) 


the posterior expected value of 0 — </> M ; and the marginal score function 
(3/3 (p) log/(.r|0). The Savage density ratio is equivalent, under certain conditions, to 








CHOOSING MODELS 


219 


the Bayes factor for the null hypothesis that 4> = (p M against the family of alternatives 
defined by the elaboration. However, in the diagnostic context, the Bayes factor is a 
sufficient summary of the data only when the loss function is assigning the same 
penalty to all incorrect models or when the elaboration is binary. The diagnostic 
approach differs from a model choice analysis based on a Bayes factor as both A 
and A l incorporate a penalty for the severity of the departure in the utility function. 


11.3 Exercises 


Problem 11.1 In the context of Section 11.2 suppose that M is such that x\M ~ 
N(On, 1 jn) where 0 Q is a known mean. Consider the elaborated model Ai given by 
x\<p,M ~ N(0 O + <p , 1 /«), and <j)\At ~ N(0, r 2 ). Show that the diagnostic A is 


A = 


2(nr 2 + l) 2 




and the linearized diagnostic A, is 


A, = 


(y - So? + 2 log(»r 2 + D- 


Comment on the strength and limitations of these as metrics for model criticism. 
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Dynamic programming 


In this chapter we begin our discussion of multistage decision problems, where multi¬ 
ple decisions have to be made over time and with varying degrees of information. The 
salient aspect of multistage decision making is that, like in a chess game, decisions 
made now affect the worthiness of options available in the future and sometimes also 
the information available when making future decisions. In statistical practice, mul¬ 
tistage problems can be used to provide a decision-theoretic foundation to the design 
of experiments, in which early decisions are concerned with which data to collect, 
and later decisions with how to use the information obtained. Chapters 13, 14, and 
15 consider multistage statistical decisions in some detail. In this chapter we will be 
concerned with the general principles underlying multistage decisions for expected 
utility maximizers. 

Finite multistage problems can be represented by decision trees. A decision tree 
is a graphical representation that allows us to visualize a large and complex decision 
problem by breaking it into smaller and simpler decision problems. In this chapter, 
we illustrate the use of decision trees in the travel insurance example of Section 7.3 
and then present a general solution approach to two-stage (Section 12.3.1) and mul¬ 
tistage (Section 12.3.2) decision trees. Examples are provided in Sections 12.4.1 
through 12.5.2. While conceptually general and powerful, the techniques we will 
present are not of easy implementation: we will discuss some of the computational 
issues in Section 12.7. 

The main principle used in solving multistage decision trees is called backwards 
induction and it emerged in the 1950s primarily through the work of Bellman (1957) 
on dynamic programming. 

Featured book (chapter 3): 

Bellman, R. E. (1957). Dynamic programming, Princeton University Press. 
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Our discussion is based on Raiffa and Schlaifer (1961). Additional useful ref¬ 
erences are DeGroot (1970), Lindley (1985), French (1988), Bather (2000), and 
Bernardo and Smith (2000). 

12.1 History 

Industrial process control has been one of the initial motivating applications for 
dynamic programming algorithms. Nemhauser illustrates the idea as follows: 

For example, consider a chemical process consisting of a heater, reactor 
and distillation tower connected in series. It is desired to determine the 
optimal temperature in the heater, the optimal reaction rate, and the opti¬ 
mal number of trays in the distillation tower. All of these decisions are 
interdependent. However, whatever temperature and reactor rate are cho¬ 
sen, the number of trays must be optimal with respect to the output from 
the reactor. Using this principle, we may say that the optimal number of 
trays is determined as a function of the reactor output. Since we do not 
know the optimal temperature or reaction rate yet, the optimal number 
of trays and return from the tower must be found for all feasible reactor 
outputs. 

Continuing sequentially, we may say that, whatever temperature is 
chosen, the reactor rate and number of trays must be optimal with respect 
to the heater output. To choose the best reaction rate as a function of 
the heater output, we must account for the dependence of the distillation 
tower of the reactor output. But we already know the optimal return from 
the tower as a function of the reactor output. Hence, the optimal reaction 
rate can be determined as a function of the reactor input, by optimizing 
the reactor together with the optimal return from the tower as a function 
of the reactor output. 

In making decisions sequentially as a function of the preceding deci¬ 
sions. the first step is to determine the number of trays as a function of the 
reactor output. Then, the optimal reaction rate is established as a function 
of the input to the reactor. Finally, the optimal temperature is determined 
as a function of the input to the heater. Finding a decision function, we 
can optimize the chemical process one stage at a time. (Nemhauser 1966, 
pp. 6-7) 

The technique just described to solve multidimensional decision problems is 
called dynamic programming, and is based on the principle of backwards induc¬ 
tion. Backwards induction has its roots in the work of Arrow et al. (1949) on optimal 
stopping problems and that of Wald (1950) on sequential decision theory. In optimal 
stopping problems one has the option to collect data sequentially. At each decision 
node two options are available: either continue sampling, or stop sampling and take a 
terminal action. We will discuss this class of problems in some detail in Chapter 15. 
Richard Bellman’s research on a large class of sequential problems in the early 1950s 
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led to the first book on the subject (Bellman 1957) (see also Bellman and Dreyfus 
1962). Bellman also coined the term dynamic programming: 

The problems we treat are programming problems, to use a terminol¬ 
ogy now popular. The adjective “dynamic”, however, indicates that we 
are interested in processes in which time plays a significant role, and in 
which the order of operations may be crucial. (Bellman 1957, p. xi) 

A far more colorful description of the political motivation behind this choice is 
reported in Bellman’s autobiography (Bellman 1984), as well as in Dreyfus (2002). 

The expression backward induction comes from the fact that the sequence of 
decisions is solved by reversing their order in time, as illustrated with the chemical 
example by Nemhauser (1966). Lindley explains the inductive technique as follows: 

For the expected utility required is that of going on and then doing the 
best possible from then onwards. Consequently in order to find the best 
decision now ... it is necessary to know the best decision in the future. 

In other words the natural time order of working from the present to 
the future is not of any use because the present optimum involves the 
future optimum. The only method is to work backwards in time: from the 
optimum future behaviour to deduce the optimum present behaviour, and 
so on back into the past. ... The whole of the future must be considered 
in deciding whether to go on. (Lindley 1961, pp. 42-43) 

The dynamic programming method allows us to conceptualize and solve prob¬ 
lems that would be far less tractable if each possible decision function, which 
depends on data and decisions that accumulate sequentially, had to be considered 
explicitly: 

In the conventional formulation, we consider the entire multi-stage deci¬ 
sion process as essentially one stage, at the expense of vastly increasing 
the dimension of the problem. Thus, if we have an /V-stage process 
where M decisions are to be made at each stage, the classical approach 
envisages an /Vf/V-dimensional single-stage process.... [I]n place of 
determining the optimal sequence of decisions from some fixed state of 
the system, we wish to determine the optimal decision to be made at any 
state of the system. ... The mathematical advantage of this formulation 
lies first of all in the fact that it reduces the dimension of the process to 
its proper level, namely the dimension of the decision which confronts 
one at any particular stage. This makes the problem analytically more 
tractable and computationally vastly simpler. Secondly,... it furnishes us 
with a type of approximation which has a unique mathematical property, 
that of monotonicity of convergence, and is well suited to applications, 
namely, “approximation in policy space”. (Bellman 1957, p. xi) 
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Roughly speaking, the technique allows one to transform a multistage decision 
problem into a series of one-stage decision problems and thus make decisions one at 
a time. This relies on the principle of optimality stated by Bellman: 

An optimal policy has the property that whatever the initial state and 
initial decision are, the remaining decisions must constitute an optimal 
policy with regard to the state resulting from the first decision. (Bellman 
1957, p. 82) 

This ultimately allows for computational advantages as explained by Nemhauser: 

We may say that a problem with N decision variables can be transformed 
into N subproblems, each containing only one decision variable. As a 
rule of thumb, the computations increase exponentially with the num¬ 
ber of variables, but only linearly with the number of subproblems. Thus 
there can be great computational savings. Of this savings makes the dif¬ 
ference between an insolvable problem and one requiring only a small 
amount of computer time. (Nemhauser 1966, p. 6) 

Dynamic programming algorithms have found applications in engineering, eco¬ 
nomics, medicine, and most recently computational biology, where they are used to 
optimally align similar biological sequences (Ewens and Grant 2001). 

12.2 The travel insurance example revisited 

We begin our discussion of dynamic programming techniques by revisiting the travel 
insurance example of Section 7.3. In that section we considered how to optimally use 
the information about the medical test. We are now going to consider the sequential 
decision problem in which, in the first stage, you have to decide whether or not 
to take the test, and in the second stage you have to decide whether or not to buy 
insurance. This is an example of a two-stage sequential decision problem. Two-stage 
or, more generally, multistage decision problems can be represented by decision trees 
by introducing additional decision nodes and chance nodes. Figure 12.1 shows the 
decision tree for this sequential problem. 

in describing the tree from left to right the natural order of the events in 
time has been followed, so that at any point of the tree the past lies to our 
left and can be studied by pursuing the branches down to the trunk, and 
the future lies to our right along the branches springing from that point 
and leading to the tips of the tree. (Lindley 1985, p. 141) 

In this example we use the loss function in the original formulation given in 
Table 7.6. However, as previously discussed, the Bayesian solution would be the 
same should we use regret losses instead, and this still applies in the multistage case. 

We are going to work within the expected utility principle, and also adhere to the 
before/after axiom across all stages of decisions. So we condition on information as 
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Figure 12.1 Decision tree for the two-stage travel insurance example. 


it accrues, and we revise our probability distribution on the states of the world using 
the Bayes rule at each stage. In this example, to calculate the Bayes strategy we need 
to obtain the posterior probabilities of becoming ill given each of the possible values 
of x, that is 


ti(6\x) = 


m(x) 


for 0 = 0i, 02 and x — 0,1. Here 


m(x) = £/(* |0*M0*) 

k=l 

is the marginal distribution of x. We get 

m(x = 1) = 0.250 
it (0] \x = 0) = 0.004 
7r(0! \x= 1) = 0.108. 
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Hence, we can calculate the posterior expected losses for rules and S 2 given 
x = 0 as 


: Expected loss = 1000 x 0.004 + 0 x 0.996 = 4 
S 2 : Expected loss = 50 x 0.004 + 50 x 0.996 = 50 

and given x = 1 as 


: Expected loss = 50 x 0.108 + 50 x 0.892 = 50 
S 2 : Expected loss = 1000 x 0.108 + 0 x 0.892 = 108. 

The expected losses for rules <5 0 and <5 3 remain as they were, because those rules 
do not depend on the data. Thus, if the test is performed and it turns out positive, 
then the optimal decision is to buy the insurance and the expected loss of this action 
is 50. If the test is negative, then the optimal decision is not to buy the insurance 
with expected loss 4. Incidentally, we knew this from Section 7.3, because we had 
calculated the Bayes rule by minimizing the Bayes risk. Here we verified that we 
get the same result by minimizing the posterior expected loss for each point in the 
sample space. 

Now we can turn to the problem of whether or not to get tested. When no test is 
performed, the optimal solution is not to buy the insurance, and that has an expected 
loss of 30. When a test is performed, we calculate the expected losses associated with 
the decision in the first stage from this perspective: we can evaluate what happens if 
the test is positive and we proceed according to the optimal strategy thereafter. Sim¬ 
ilarly, we can evaluate what happens if the test is negative and we proceed according 
to the optimal strategy thereafter. So what we expect to happen is a weighted average 
of the two optimal expected losses, each conditional on one of the possible outcomes 
of x. The weights are the probabilities of the outcomes of x at the present time. 
Accordingly, the expected loss if the test is chosen is 

50 x m(x = 1) + 4 x mix = 0) = 50 x 0.25 + 4 x 0.75 = 15.5. 

Comparing this to 30, the value we get if we do not test, we conclude that the optimal 
decision is to test. This is reassuring: the test is free (so far) and the information 
provided by the test would contribute to our decision, so it is logical that the optimum 
at the first stage is to acquire the information. Overall, the optimal sequential strategy 
is as follows. You should take the test. If the test is positive, then buy the medical 
insurance. If the test is negative, then you should not buy the medical insurance. The 
complete analysis of this decision problem is summarized in Figure 12.2. 

Lindley comments on the rationale behind solving a decision tree: 

Like a real tree, a decision tree contains parts that act together, or cohere. 

Our method solves the easier problems that occur at the different parts 
of the tree and then uses the rules of probability to make them cohere ... 
it is coherence that is the principal novelty and the major tool in what 
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we do. ... The solution of the common subproblem, plus coherence fit¬ 
ting together of the parts, produces the solution to the complex problem. 

The subproblems are like bricks, the concrete provides the coherence. 
(Lindley 1985, pp. 139,146) 

In our calculations so far we ignored the costs of testing. How would the solution 
change if the cost of testing was, say, $10? All terminal branches that stem out from 
the action “test” in Figure 12.2 would have the added cost of $10. This would imply 
that the expected loss for the action “test” would be 25.5 while the expected loss for 
the action “no test” would still be 30.0. Thus, the optimal decision would still be to 
take the test. The difference between the expected loss of 30 if no test information 
is available and that of 15.5 if the test is available is the largest price you should 



Figure 12.2 Solved decision tree for the two-stage travel insurance example. 
Losses are given at the end of the branches. Above each chance node is the expected 
loss, while above each decision node is the maximum expected utility. Alongside 
the branches stemming from each chance node are the probabilities of the states of 
nature. Actions that are not optimal are crossed out by a double line. 
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be willing to pay for the test. In the context of this specific decision this difference 
measures the worthiness of the information provided by the test—a theme that will 
return in Chapter 13. 

Here we worked out the optimal sequential strategy in stages, by solving the 
insurance problem first, and then nesting the solution into the diagnostic test prob¬ 
lem. Alternatively, we could have laid out all the possible sequential strategies. In 
this case there are six: 


Stage 1 

Stage 2 

Do not test 

Do not buy the insurance 

Do not test 

Buy the insurance 

Test 

Do not buy the insurance 

Test 

Buy the insurance if x = 1. Otherwise, do not 

Test 

Buy the insurance if x = 0. Otherwise, do not 

Test 

Buy the insurance 


We could alternatively figure out the expected loss of each of these six options, and 
pick the one with the lowest value. It turns out that the answer would be the same. 
This is an instance of the equivalence between the normal form of the analysis— 
which lists all possible decision rules as we just did—and the extensive form of 
analysis, represented by the tree. We elaborate on this equivalence in the next section. 


12.3 Dynamic programming 

12.3.1 Two-stage finite decision problems 

A general two-stage finite decision problem can be represented by the decision tree 
of Figure 12.3. As in the travel insurance example, the tree represents the sequential 
problem with decision nodes shown in a chronological order. 

To formalize the solution we first introduce some additional notation. Actions 
will carry a superscript in brackets indicating the stage at which they are available. 
So a!f ,..., df^ 1 are the actions available at stage .v. In Figure 12.3, .v is either 1 or 2. 
For each action at stage 1 we have a set of possible observations that will potentially 
guide the decision at stage 2. For action d' ! these are indicated by x n ,... ,x u . For 
each action at stage 2 we have the same set of possible states of the world ..., 0 K , 
and for each combination of actions and states of the world, as usual, an outcome z. 
If actions a'J’ and af 2 are chosen and 0 k is the true state of the world, the outcome 
is Zi t i 2 k■ A warning about notation: for the remainder of the book, we are going to 
abandon the loss notation in favor of the utility notation used in the discussion of 
foundation. Finally, in the above, we assume an equal number of possible states of 
the world and outcomes to simplify notation. A more general formulation is given in 
Section 12.3.2. 

In our travel insurance example we used backwards induction in that we worked 
from the outermost branches of the decision tree (the right side of Figure 12.3) back 
to the root node on the left. We proceeded from the terminal branches to the root by 
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Figure 12.3 A general two-stage decision tree with finite states of nature and data. 

alternating the calculation of expected utilities at the random nodes and maximiza¬ 
tion of expected utilities at the decision nodes. This can be formalized for a general 
two-stage decision tree as follows. 

At the second stage, given that we chose action «'■' 1 in the first stage, and that 
outcome x, 1( was observed, we choose a terminal action a* (2) to achieve 

K 

max V' 7i(6 k \x hj )u(z h i 2k ). (12.1) 

1 < i2—h ‘ J 


At stage 2 there only remains uncertainty about the state of nature 6. As usual, this 
is addressed by taking the expected value of the utilities with respect to the pos¬ 
terior distribution. We then choose the action that maximizes the expected utility. 
Expression (12.1) depends on i t and j, so this maximization defines a function 
S* (2) (a ( ' l ,x ii j) which provides us with a recipe for how to optimally proceed in every 
possible scenario. An important difference with Chapter 7 is the dependence on the 
first-stage decision, typically necessary because different actions may be available 
for <5 (2) to choose from, depending on what was chosen earlier. 
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Equipped with <5* (2) , we then step back to stage 1 and choose an action a* (r> to 
achieve 


max 


E 

t 7=1 


max 

l</'2</2 


E Tt(9 k \Xiy) u ( z h i 2k ) 




( 12 . 2 ) 


The inner maximum is expression (12.1) and we have it from the previous step. The 
outer summation in (12.2) computes the expected utility associated with choosing 
the qth action at stage 1, and then proceeding optimally in stage 2. When the stage 1 
decision is made, the maximum expected utilities ensuing from the available actions 
are uncertain because the outcome of x is not known. We address this by taking 
the expected value of the maximum utilities calculated in (12.1) with respect to the 
marginal distribution of x, for each action a''*. The result is the outer summation 
in (12.2). The optimal solution at the first stage is then the action that maximizes the 
outer summation. At the end of the whole process, an optimal sequential decision 
rule is available in the form of a pair <5* (2) ). 

To map this procedure back to our discussion of optimal decision functions in 
Chapter 7, imagine we are minimizing the negative of the function u in expres¬ 
sion (12.2). The innermost maximization is the familiar minimization of posterior 
expected loss. The optimal a* (2> depends on x. In the terminology of Chapter 7 this 
optimization will define a formal Bayes rule with respect to the loss function given 
by the negative of the utility. The summation with respect to j computes an aver¬ 
age of the posterior expected losses with respect to the marginal distribution of the 
observations, and so it effectively is a Bayes risk r, so long as the loss is bounded. 
In the two-stage setting we get a Bayes risk for every stage 1 action a-J*, and then 
choose the action with the lowest Bayes risk. We now have a plan for what to do at 
stage 1, and for what to do at stage 2 in response to any of the potential experimental 
outcomes that result from the stage 1 decision. 

In this multistage decision, the likelihood principle operates at stage 2 but not 
at stage 1. At stage 2 the data are known and the alternative results that never 
occurred have become irrelevant. At stage 1 the data are unknown, and the distribu¬ 
tion of possible experimental results is essential for making the experimental design 
decision. 

Another important connection between this discussion and that of Chapter 7 con¬ 
cerns the equivalence of normal and extensive forms of analysis. Using the Bayes 
rule we can rewrite the outer summation in expression (12.2) as 


E 

7=1 


max 

i <ii<h 




m (x iv ) = 


j =1 


max 

1 <il<h 


E n ( e k)f( X iJ e k) U fel hk) 


The function 8* that is defined by maximizing the inner sum pointwise for each i and 
j will also maximize the average expected utility, so we can rewrite the right hand 
side as 

- J K 

EE d k )f(x hj \e k )u(z h i 2k ) 

_ j= 1 k= 1 


max 

8 
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Here <5 is not explicitly represented in the expression in square brackets. Each choice 
of 8 specifies how subscript i 2 is determined as a function of i, and j. Reversing the 
order of summations we get 


max 

s 


K 


^2^ (o k ) 


^fiXiJOk) u (z hi2k ) 
. j= i 


Lastly, inserting this into expression (12.2) we obtain an equivalent representation as 


max max 


'Y^TZ(Q k ) 


IflO u (Ziii 2 k) 


j =i 


(12.3) 


This equation gives the solution in normal form: rather than alternating expecta¬ 
tions and maximizations, the two stages are maximized jointly with respect to pairs 
(a 1 ' 1 ,8 ,2y ). The inner summation gives the expected utility of a decision rule, given the 
state of nature, over repeated experiments. The outer summation gives the expected 
utility of a decision rule averaging over the states of nature. A similar equivalence 
was brought up earlier on in Chapter 7 when we discussed the relationship between 
the posterior expected loss (in the extensive form) and the Bayes risk (in the normal 
form of analysis). 


12.3.2 More than two stages 

We are now going to extend these concepts to more general multistage problems, and 
consider the decision tree of Figure 12.4. As before, stages are chronological from 
left to right, information may become available between stages, and each decision is 
dependent on decisions made in earlier stages. We will assume that the decision tree 
is bounded, in the sense that there is a finite number of both stages and decisions at 
each stage. 

First we need to set the notation. As before, S is the number of decision stages. 
Now dg\ ... , df‘ are the decisions available at the .sth stage, with s = 1,... , 5 . At 
each stage, the action dg is different from the rest in that, if dg is taken, no further 
stages take place, and the decision problem terminates. Formally dg maps states to 
outcomes in the standard way. For each stopping action we have a set of relevant 
states of the world that constitute the domain of the action. At stage s, the possible 
states are 9$,... ,6$ . For each continuation action d°\ i > 0,1 < s < S, we 
observe a random variable x (y) with possible values xff,. . ., If stage s is reached, 
the decision may depend on all the decisions that were made, and all the information 
that was accrued in preceding stages. We concisely refer to this information as the 
history, and use the notation Tl s -i where 


i_/ _ r„(D „(*-!) r (D r (*-D 


s = 2,...,S. 


For completeness of notation, the (empty) history prior to stage 1 is denoted by Tl 0 . 
Finally, at the last stage S, the set of states of the world constituting the domain of 


action a® is 0,®, 


D(s) 

y i S Ks 
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(i) 



Figure 12.4 Decision tree for a generic multistage decision problem. 


It is straightforward to extend this setting to the case in which J s also depends on 
i without any conceptual modification. We avoid it here for notational simplicity. 

Dynamic programming proceeds as follows. We start by solving the decision 
problem from the last stage, stage S. by maximizing expected utility for every pos¬ 
sible history. Then, conditional on the optimal choice made at stage S, we solve 
the problem at stage S — 1, by maximizing again the expected maximum utility. 
This procedure is repeated until we solve the decision problem at the first stage. 
Algorithmically, we can describe this recursive procedure as follows: 

I. At stage S: 

(i) For every possible history 'H s i, compute the expected utility of the 
actions af\ ; = I s , available at stage S, using 



(12.4) 


where uf(H S -i) = u(z^... is _ lik ). 


’i 1 -’S-l 


(ii) Obtain the optimal action, that is 


a* (S) (Tls-i ) = argma \U s (a^). 


(12.5) 
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This is a function of H. S -i because both the posterior distribution of 9 
and the utility of the outcomes depend on the past history of decisions 
and observations. 

II. For stages 5—1 through 1, repeat: 

(i) For every possible history 7Y y _i, compute the expected utility of actions 

af, i = available at stage .v, using 

Js 

U\af) = Y (x<f|?Ci) i > 0 (12.6) 

j= 1 
K s 

U\af) = 

k= 1 

where now u-f are the expected utilities associated with the optimal con¬ 
tinuation from stage s+ 1 on, given that that a-’ is chosen and xf occurs. 
If we indicate by {7i s -i , a- ,x-, s) } the resulting history, then we have 

= U s+ \a <s + l \{H s -uaf ,*»})) i > 0. (12.7) 

A special case is the utility of the decision to stop, which is 

U 0k = “(Z'f... ,,_!<»)• 

(ii) Obtain the optimal action, that is 

a Hs \n s ^) = argmax U\af). (12.8) 

i 

(iii) Move to stage s — 1, or stop if 5 = 1. 

This algorithm identifies a sequence of functions a Hs} (Tt s -i), s=l,...,S, that defines 
the optimal sequential solution. 

In a large class of problems we only need to allow z to depend on the unknown 9 
and the terminal action. This still covers problems in which the first 5—1 stages are 
information gathering and free, while the last stage is a statistical decision involving 
9. See also Problem 12.4. We now illustrate this general algorithm in two artificial 
examples, and then move to applications. 


12.4 Trading off immediate gains and information 

12.4.1 The secretary problem 

For an application of the dynamic programming technique we will consider a version 
of a famous problem usually referred to in the literature under a variety of politically 
incorrect denominations such as the “secretary problem” or the “beauty-contest prob¬ 
lem.” Lindley (1961), DeGroot (1970), and Bernardo and Smith (1994) all discuss 
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this example, and there is a large literature including versions that can get quite 
complicated. The reason for all this interest is that this example, while relatively 
simple, captures one of the most fascinating sides of sequential analysis: that is, 
the ability of optimal solutions to negotiate trade-offs between immediate gains and 
information-gathering activities that may lead to greater future gains. 

We decided to make up yet another story for this example. After successful inter¬ 
views in S different companies you are the top candidate in each one of them. One 
by one, and sequentially, you are going to receive an offer from each company. Once 
you receive an offer, you may accept it or you may decline it and wait for the next 
offer. You cannot consider more than one offer at the time, and if you decide to 
decline an offer, you cannot go back and accept it later. You have no information 
about whether the offers that you have not seen are better or worse than the ones you 
have seen: the offers come in random order. The information you do have about each 
offer is its relative rank among the offers you previously received. This is reasonable 
if the companies are more or less equally desirable to you, but the conditions of the 
offers vary. If you refuse all previous offers, and you reach the 5th stage, you take 
the last offer. 

When should you accept an offer? How early should you do it and how high 
should the rank be? If you accept too soon, then it is possible that you will miss 
the opportunity to take better future offers. On the other hand, if you wait too long, 
you may decline offers that turn out to be better than the one you ultimately accept. 
Waiting gives you more information, but fewer opportunities to use it. 

Figure 12.5 has a decision tree representation of two consecutive stages of the 
problem. At stage s, you examined s offers. You can choose decision af\ to wait for 



Figure 12.5 Decision tree representing two stages of the sequential problem of 
Section 12.4.1. At any stage s, if action af is chosen, the unknown rank 6 (s> takes 
a value between 1 and S. If action af is chosen, a relative rank between 1 and s is 
observed. 
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the next offer, or aff, to accept the current offer and stop the process. If you stop at s, 
your utility will depend on the unknown rank 0 {K) of offer s: 0 <s> is a number between 
1 and S. At s you make a decision based on the observed rank x (s) of offer .v among the 
5 already received. This is a number between 1 and s. In summary, 0 (s) and x is) refer 
to the same offer. The first is the unknown rank if the offer is accepted. The second 
is the known rank being used to make the decision. If you continue until stage S, the 
true relative rank of each offer will become known also. 

The last assumption we need is a utility function. We take u(9) be the util¬ 
ity of selecting the job that has rank 0 among all S options, and assume u( 1) > 
u(2) > • ■ ■ > u(S). This utility will be the same irrespective of the stage at which the 
decision is reached. 

We are now able to evaluate the expected utility of both accepting the offer (stop¬ 
ping) and declining it (continuing) at stage s. If we are at stage s we must have 
rejected all previous offers. Also the relative ranks of the older offers are now irrel¬ 
evant. Therefore the only part of the history 7i s that matters is that the 5th offer has 
rank x (,s) . 

Stopping. Consider the probability 7r(0 (s) |x (s) ) that the 5th offer, with observed 
rank x (l) , has, in fact, rank 0 (:<l among all S offers. As we do not have any a priori 
information about the offers that have not yet been made, we can evaluate 7r(0 w |x w ) 
as the probability that, in a random sample of 5 companies taken from a population 
with S companies, x (f) — 1 are from the highest-ranking 0 (s) — 1 companies, one has 
rank 0 (5) , and the remaining 5 — x (s) are from the lowest-ranking S — 0 (s) companies. 
Thus, we have 


^( 0<5, l^))=(^_])0_^)/Q’ f orx (l,) < 0 (s) < S — 5 + x (s) . (12.9) 

Let U{a^) denote the expected utility of decision af ; that is, of accepting the 5th 
offer. Equation (12.9) implies that 


U{$) = 


S-s+x (s) 

^2 w(0) 

0 =*< s > 



/S-0 

\5 — X (s) 



5=1,..., S. (12.10) 


Waiting. On the other hand, we need to consider the expected utility of taking 
decision aj' 1 and continuing optimally after s. Say again the 5th offer has rank x (s \ 
and define U= b(s,x is) ). If you decide to wait for the next offer, the probability 
that the next offer will have an observed rank x ( * +1) , given that the 5th offer has rank 
x (s) , is 1/(5 + 1), as the offers arrive at random and all the 5 + 1 values of x (j+1) 
are equally probable. Thus, the expected utility of waiting for the next offer and 
continuing optimally is 


U*(af) = b(s,x (sy ) = 


1 

S+ 1 


■H-l 

^£>( 5 + l,x). 

x=l 


( 12 . 11 ) 
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Also, at stage S we must stop, so the following relation must be satisfied: 

b(S,x (S) ) = W(af ) = U(af) x (S) = 1,..., 5. (12.12) 


Because we know the right hand side from equation (12.10), we can use equa¬ 
tion (12.11) to recursively step back and determine the expected utilities of con¬ 
tinuing. The optimal solution at stage s is, depending on x*' 1 , to wait for the next offer 
if U*(a f ) > IA(ci { q) or accept the current offer if U*(a\' ] ) — //(a[f). 

A simple utility specification that allows for a more explicit solution is one where 
your only goal is to get the best offer, with all other ranks being equally disliked. 
Formally, u(l) = 1 and u{9) = 0 for 0 > 1. It then follows from equation (12.10) 
that, for any s, 


U{a «) = 



if x = 1 
if x > 1. 


(12.13) 


This implies that U '(a^) > U(a^) whenever x (s) > 1. Therefore, you should wait 
for the next offer if the rank of the vth offer is not the best so far. So far this is a bit 
obvious: if your current offer is not the best so far, it cannot be the best overall, and 
you might as well continue. But what should you do if the current offer has observed 
rank 1? 

The largest expected utility achievable at stage s is the largest of the expected util¬ 
ities of the two possible decisions. Writing the expected utility of continuing using 
equation (12.11), we have 


U*(a (s) ) = max ( U{af), — — V Ms + l,x) ) . (12.14) 

\ * +1 ±r / 

Let v(s) = (1 /(s + 1)) b(s + 1 ,x). One can show (see Problem 12.5) that 

v(.v) = —'— WV/ +1) ) + A V ( J + !)■ ( 12 - 15 ) 

5+1 5+1 

At the last stage, from equations (12.12) and (12.14) we obtain that U*(af ) ) = 1 and 
v(S) = 0. By backward induction on s it can be verified that (see Problem 12.5) 

5/1 1 1\ 

^=s(s^ + S^2 + - + s)- (12 ' 16) 

Therefore, from equations (12.14) and (12.16) at stage s, withx (s) = 1 

«•(„<■>) = "ax (i,i(+ rf + ... + !)) (1117) 

Let 5 * be the smallest positive integer so that (1/(5 — 1) + • • • + I/ 5 ) < 1. The 
optimal procedure is to wait until s* offers are made. If the 5th offer is the best so far 
far, accept it. Otherwise, wait until you reach the best offer so far and accept it. If 
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you reach stage S you have to accept the offer irrespective of the rank. If S is large, 
s* Rs S/e. So you should first just wait and observe approximately 1/e ^ 36% of 
the offers. At that point the information-gathering stage ends and you accept the first 
offer that ranks above all previous offers. 

Lindley (1961) provides a solution to this problem when the utility is linear in 
the rank, that is when u{0 ) = S — 0. 

12.4.2 The prophet inequality 

This is an example discussed in Bather (2000). It has points in common with 
the one in the previous section. What it adds is a simple example of a so-called 
“prophet inequality”—a bound on the expected utility that can be expected if 
information that accrues sequentially was instead revealed in its entirety at the 
beginning. 

You have the choice of one out of a set of S options. The values of the options 
are represented by nonnegative and independent random quantities 0®, ..., 0® with 
known probability distributions. Say you decided to invest a fixed amount of money 
for a year, and you are trying to choose among investment options with random 
returns. Let /i, denote the expected return of option s, that is jjl s = Zs[0 (s) ], s = 
1 ,... ,S. Assume that these means are finite, and assume that returns are in the utility 
scale. 

Let us first consider two extreme situations. If you are asked to choose a single 
option now, your expected utility is maximized by choosing the option with largest 
mean, and that maximum is g s = max(/r 1 ,..., //<,). At the other extreme, if you are 
allowed to communicate with a prophet that knows all the true returns, then your 
expected utility is h s — E[ max(0 (1) ,..., 0®)]. Now, consider the sequential case in 
which you can ask the prophet about one option at the time. Suppose that the val¬ 
ues of 0 a \ ..., 0® are revealed sequentially in this order. At each stage, you are only 
allowed to ask about the next option if you reject the ones examined so far. Cruel per¬ 
haps, but it makes for another interesting trade-off between learning and immediate 
gains. 

At stage s , decisions can be df, wait for the next offer, or a 1 / 1 , accept the offer and 
stop bothering the prophet. If you stop, your utility will be the value 0 (l) of the option 
you decided to accept. On the other hand, the expected utility of continuing U (s \a { / ] ). 
At the last stage, when there is only one option to be revealed, the expectation of the 
best utility that can be reached is W*®(a[f) = [i s , because a™ is the only decision 
available. So when folding back to stage 5—1, you either accept option 0 (s_1) or wait 
to see 0®. That is, the maximum expected utility is = £'[max(0 ( ' s_1) , = 

E[max(0 (s_1) ,W*®)]. With backward induction, we find that the expected maximum 
utility at stage s is 

U* is) = E[ max(0 (s) ,W* (s+1) )], for s = 1,... ,5 - 1. (12.18) 

To simplify notation, let w s denote the expected maximum utility when there are 
s options to consider, that is vv s = U HS ~ s+l \ s = 1,... ,5. What is the advantage 
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of obtaining full information of the options over the sequential procedure? We will 
prove the prophet’s inequality 


h s < 2w s . (12.19) 

This inequality is explained by Bather as follows: 

[I]t shows that a sequential decision maker can always expect to gain at 
least half as much as a prophet who is able to foresee the future, and 
make a choice with full information on all the values available. (Bather 
2000 , p. Ill) 

Proof: To prove the inequality, first observe that w, = /i s . Then, since vv 2 = 
£'[max(0 (s_1) , Wi)], it follows that w 2 > £'[0 (,s ” 1) ] = Hs-i and w 2 > wq. Recursively, 
we can prove that the sequence w s is increasing, that is 

W\ <w 2 < ■ ■ ■ <w s . ( 12 . 20 ) 

Define y^ — 9 {1) and y r — max(0 (r) , w r _i) for r — 2,..., S. Moreover, define w 0 = 0 
and z r = max(0 w — w r _ u 0). This implies that y r = w r _i + z r • From (12.20), w r _ t < 
w 5 _i for r = 1,... ,S. Thus, 


y r < Ws-i +z r , forr = 1,...,5. (12.21) 

For any r, it follows from our definitions that 6 (r) < y, . Thus, 

max( 0 (1) ,..., 9^ S) ) < max(y l5 ... ,y^) 

using equation ( 12 . 21 ), 

< w s _i +max(z 1 ,...,z s ). ( 12 . 22 ) 


By definition, z r > 0. Thus, maxfo, ... ,Zs) < Zi + • • • + Zs- By combining the latter 
inequality with that in ( 12 . 22 ), and taking expectations, we obtain 


£’[max(0 (1) ,..., 0®)] < w s _! + E[z, + ■ ■ ■ + z s ] 

s s 

h s < w 5 _! + ^2 = ^s-i + X! E ^ r ~ Wr ~^ 


r= 1 

s 


hs < w S -i + - wy_,) 


h s < W 5 _i + W 5 < 2 Wj, 


(12.23) 


which completes the proof. 


□ 


12.5 Sequential clinical trials 
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Our next set of illustrations, while still highly stylized, bring us far closer to real 
applications, particularly in the area of clinical trials. Clinical trials are scientific 
experiments on real patients, and involve the conflicting goals of learning about 
treatments, so that a broad range of patients may benefit from medical advances, 
and ensuring a good outcome for the patients enrolled in the trial. For a more 
general discussion on Bayesian sequential analysis of clinical trials see, for exam¬ 
ple, Berry (1987) and Berger and Berry (1988). While we will not examine this 
controversy in detail here, in the sequential setting the Bayesian and frequentist 
approaches can differ in important ways. Lewis and Berry (1994) discuss the value of 
sequential decision-theoretic approaches over conventional approaches. More specif¬ 
ically, they compare performances of Bayesian decision-theoretic designs to classical 
group-sequential designs of Pocock (1977) and O’Brien and Fleming (1979). This 
comparison is within the hypothesis testing framework in which the parameters 
associated with the utility function are chosen to yield classical type I and type II 
error rates comparable to those derived under the Pocock and O’Brien and Fleming 
designs. While the power functions of the Bayesian sequential decision-theoretic 
designs are similar to those of the classical designs, they usually have smaller-mean 
sample sizes. 


12.5.1 Two-armed bandit problems 

A decision maker can take a fixed number n of observations sequentially. At each 
stage, he or she can choose to observe either a random variable x with density/ r (.|0 o ) 
or a random variable y with /((. 1). The decision problem is to find a sequential pro¬ 
cedure that maximizes the expected value of the sum of the observations. This is an 
example of a two-armed bandit problem discussed in DeGroot (1970) and Berry and 
Fristedt (1985). For a clinical trial connection, imagine the two arms being two treat¬ 
ments for a particular illness, and the random variables being measures of well-being 
of the patients treated. The goal is to maximize the patient’s well-being, but some 
early experimentation is necessary on both treatments to establish how to best do so. 

We can formalize the solution to this problem using dynamic programming. At 
stages s = 0,...,«, the decision is either observe x, or a'" 1 , observe y. Let 
7 T denote the joint prior distribution for 0 = (0 o , #,) and consider the maximum 
expected sum of the n observations given by 

= ma x{U{a ( "\Ula^)) (12.24) 

where Uia]"’), i = 0,1, is calculated under distribution it. 

Suppose the first observation is taken from x. The joint posterior distribu¬ 
tion of (<9 0 ,(9i) is tt x and the expected sum of the remaining (n — 1) observa¬ 
tions is given by V ( "~ l) (n x ). Then, the expected sum of all n observations is 
IA(ciq ) ) = E x [x-\-V {,, - 1) {tx *)]. Similarly, if the first observation comes from v, the 
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expected sum of all n observations is U{a'" ) ) — E y \y + V ( " l> (7r y )]. Thus, the optimal 
procedure a ,(n> has expected utility 

V (n) {n) = max \E X [x + V^Cjt,)], E y [y + V^in ,)}} . (12.25) 

With the initial condition that V (0, (7r) = 0, one can solve V <s) (jr), s = I, n, and 
find the sequential optimal procedure by induction. 

12.5.2 Adaptive designs for binary outcomes 

To continue in a similar vein, consider a clinical trial that compares two treatments, 
say A and B. Patients arrive sequentially, and each can only receive one of the two 
therapies. Therapy is chosen on the basis of the information available up until that 
point on the treatments’ efficacy. Let n denote the total number of patients in the 
trial. At the end of the trial, the best-looking therapy is assigned to (N — n ) addi¬ 
tional patients. The total number N of patients involved is called the patient horizon. 
A natural question in this setting is how to optimally allocate patients in the trial to 
maximize the number of patients who respond positively to the treatment over the 
patient horizon N. This problem is solved with dynamic programming. 

Let us say that response to treatment is a binary event, and call a positive response 
simply “response.” Let 0 A and 0 B denote the population proportion of responses under 
treatments A and B, respectively. For a prior distribution, assume that 0 A and 0 B are 
independent and with a uniform distribution on (0,1). Let n A denote the number of 
patients assigned to treatment A and r A denote the number of responses among those 
n A patients. Similarly, define n B and r B for treatment B. 

To illustrate the required calculations, assume n = 4 and N — 100. When using 
dynamic programming in the last stage, we need to consider all possibilities for 
which n A + n B = n — 4. Consider for example n A — 3 and n B = 1. Under this con¬ 
figuration, r A takes values in 0,..., 3 while r B takes values 0 or 1. Consider 
n A — 3,r A — 2,n B = I, r n = 0. The posterior distribution of 0 A is Beta( 3,2) and the 
predictive probability of response is 3/5, while 0 B is Beta( 1,2) with predictive prob¬ 
ability 1/3. Because 3/5 > 1/3 the remaining A — n = 100 — 4 = 96 patients are 
assigned to A with an expected number of responses given by 96 x 3/5 = 57.6. Simi¬ 
lar calculations can be carried out for all other combinations of treatment allocations 
and experimental outcomes. 

Next, take one step back and consider the cases in which n A + n B = 3. Take 
n A = 2, n B = 1 for an example. Now, r , takes values 0,1, or 2 while r B takes values 
0 or 1. Consider the case n A — 2, r A = 1 ,n B = 1, r B — 0. The current posterior dis¬ 
tributions are Beta( 2,2) for 0 A and Beta( 1,2) for 0 B . If we are allowed one additional 
observation in the trial before we get to n — 4, to calculate the expected number of 
future responses we need to consider two possibilities: assigning the last patient to 
treatment A or to treatment B. 

Choosing A moves the process to the case n A = 3, n B — 1. If the patient responds, 
r A = 2, r B = 0 with predictive probability 2/4 = 1/2. Otherwise, with probability 
1 /2, the process moves to the case r A — I, r H = 0. The maximal number of responses 
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is 57.6 (as calculated above) with probability 1 /2 and 38.4 (this calculation is omitted 
and left as an exercise) with probability 1/2. Thus, the expected number of future 
responses with A is (1 + 57.6) x 1/2 + 38.4 x 1/2 = 48.5. Similarly, when treatment 
B is chosen for the next patient n A = 2, n B = 2. If the patient responds to treatment B, 
which happens with probability 1 /3, the process moves to r A = I, r B = 1; otherwise 
it moves to r A = 1 ,r B = 0. One can show that the expected number of successes 
under treatment B is 48.3. Because 48.5 > 48.3, the optimal decision is to assign the 
next patient to treatment A. 

After calculating the maximal expected number of responses for all cells under 
which n A + n B = 3, we turn to the case with n A + n B = 2 and so on until n A + 
n B = 0. Figure 12.6 shows the optimal decisions when n = 4 and N — 100 for 
each combination of n A , r A ,n B , r B . Each separated block of cells corresponds to a pair 
(n A ,n B ). Within each block, each individual cell is a combination of r A = 0 ,n A 
and r B = 0,..., n B . In the figure, empty square boxes denote cases for which A is 
the optimal decision while full square boxes denote cases for which B is optimal. 
An asterisk represents cases where both treatments are optimal. For instance, the left 
bottom corner of the figure has n A = r A = 0, n B = r B = 0, for which both treatments 
are optimal. Then, in the top left block of cells when n A = r A = 0 and n B = 4, the 
optimal treatment is A for r B = 0,1. Otherwise, B is optimal. Figure 12.7 shows the 
optimal design under two different prior choices on 0 A . These priors were chosen to 
have mean 0.5, but different variances. 
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Figure 12.6 Optimal decisions when n = 4 and N = 100 given available data as 
given by n A , r A , n B , and r B . 
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Figure 12.7 Sensitivity of the optimal decisions under two different prior choices 
for 0 A when n = 4 and N = 100. Designs shown in the left panel assume that 
9 a ~ Beta( 0.5,0.5). In the right panel we assume 9 A ~ Beta(2, 2). 



Figure 12.8 Optimal decisions when n = 8 and N = 100 given available data as 
given by n A , r A , n B , and r B . 


In Figure 12.8 we also show the optimal design when n — 8 and N = 100 
assuming independent uniform priors for 9 A and 0 B , while in Figure 12.9 we give an 
example of the sensitivity of the design to prior choices on 9 A . Figures in this section 
are similar to those presented in Berry and Stangl (1996, chapter 1). 
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Figure 12.9 Sensitivity of the optimal decisions under two different prior choices 
for 0 A when n = 8,N = 100. Designs shown in the left panel assume that d A ~ 
Beta( 0.5,0.5). In the right panel we assume 0 A ~ Beta{2, 2). 


In the example we presented, the allocation of patients to treatments is adaptive, 
in that it depends on the results of previous patients assigned to the same treat¬ 
ments. This is an example of an adaptive design. Berry and Eick (1995) discuss 
the role of adaptive designs in clinical trials and compare some adaptive strategies 
to balanced randomization which has a traditional and dominant role in random¬ 
ized clinical trials. They conclude that the Bayesian decision-theoretic adaptive 
design described in the above example (assuming that 0 A and 0 B are independent and 
with uniform distributions) is better for any choice of patient horizon N. However, 
when the patient horizon is very large, this strategy does not perform much bet¬ 
ter than balanced randomization. In fact, balanced randomization is a good solution 
when the aim of the design is maximizing effective treatment in the whole popu¬ 
lation. However, if the condition being treated is rare then learning with the aim 
of extrapolating to a future population is much less important, because the future 
population may not be large. In these cases, using an adaptive procedure is more 
critical. 


12.6 Variable selection in multiple regression 

We now move to an application to multiple regression, drawn from Lindley (1968a). 
Let y denote a response variable, while x = (x,,... ,x p )' is vector of explanatory 
variables. The relationship between y and x is governed by 

E\y\9,x]=x'0 
Var[y|0,x] = cr 2 ; 
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that is, we have a linear multiple regression model with constant variance. To 
simplify our discussion we assume that a 2 is known. We make two additional 
assumptions. First, we assume that 0 and x are independent, that is 

f(x,0) =f{x)7t(0). 

This means that learning the value of the explanatory variables alone does not carry 
any information about the regression coefficients. Second, we assume that y has 
known density f(y\x, 0), so that we do not learn anything new about the likelihood 
function from new observations of either x or y. 

The decision maker has to predict a future value of the dependent variable y. 
To help with the prediction he or she can observe some subset (or all) of the p 
explanatory variables. So, the questions of interest are: (i) which explanatory vari¬ 
ables should the decision maker choose; and (ii) having observed the explanatory 
variables of his or her choice, what is the prediction for y? 

Let I denote the subset of integers in the set {1,...,/?} and J denote the com¬ 
plementary subset. Let x 1 denote a vector with components x,, i e I. Our decision 
space consists of elements a = ( I,g( .)) where g is a function from 91 s to Hi; that is, it 
consists of a subset of explanatory variables and a prediction function. Let us assume 
that our loss function is 


u(a(0)) = -(y - g(x')) 2 - c,\ 

that is, we have a quadratic utility for the prediction g(x') for y with a cost c, for 
observing explanatory variables with indexes in I. This allows different variables to 
have different costs. 

Figure 12.10 shows the decision tree for this problem. At the last stage, we 
consider the expected utility, for fixed x‘ , averaging over y, that is we compute 

E y \A-(y ~ g(x')) 2 ] - c,. (12.26) 

Next, we select the prediction g(x') that maximizes equation (12.26). Under the 
quadratic utility function, the optimal prediction is given by 

g(x') = E\y)x']. (12.27) 



Figure 12.10 Decision tree for selecting variables with the goal of prediction. 
Figure adapted from Lindley (1968a). 
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The unknowns in this prediction problem include both the parameters and the 
unobserved variables. Thus 


E[y\x'} = J I E[y\x‘,x J ,0]7r(0)f(x J \x')d0dx J 
Jx Je 

= j j x'f)jr(0) f(x J \x')df)dx' 

Jx Je 

= E[x\x']'E[0]- 


that is, to obtain the optimal prediction we first estimate the unobserved explanatory 
variables x J and combine those with the observed x‘ and the estimated regression 
parameters. 

Folding back the decision tree, by averaging over x‘ for fixed /, we obtain 

£[(y - E[x\x‘]'E[0]) 2 ] = a 2 + E[x'0 - E[x\x']'E[0]] 2 

= a 2 + fr(Var[0]Var[x]) + E[jc]'Var[0]£[x] 

+ E[6]'Va.r[x J ]E[0] 


where Var[0 ] and Var[x] are the covariance matrices for 6 and x, respectively, and 
Var|x' | = E[(x — E[x\x'])(x — E[x\x' ]')], that is the covariance matrix of the whole 
predictor vector, once the subset x‘ is fixed. 

Thus, the expected utility for fixed / is 

-(cr 2 + m(Var[6»]Var[jc]) + £[jc]'Var[6»]£ , [x] + £[0]'Var[y ]£[0] + c,). (12.28) 


The optimal solution at the first stage of the decision tree is obtained by maximiz¬ 
ing equation (12.28) over /. Because only the last two elements of equation (12.28) 
depend on /, this corresponds to choosing I to reach 

mm{E[6]'Var[x J ]E[0] + c,}. 

/ 

The solution will depend both on how well the included and excluded x predict y, 
and on how well the included x‘ predict the excluded x J . 

As a special case, suppose that the costs of each observation x h i e /, are 
additive so that c, = ; c,. Moreover, assume that the explanatory variables are 

independent. This implies 


min{£[0]'Var[jc J ]£[6»] + c,} = min 
/ / 



It follows from the above equation that x, should be observed if and only if 


(£[0 i ]) 2 Var[x,] > c ; ; 
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Figure 12.11 Decision tree for selecting variables with the goal of controlling the 
response. Figure adapted from Lindley (1968a). 


that is, if either its variation is large or the squared expected value of the regression 
parameter is large compared to the cost of observation. 

Let us now change the decision problem and assume that we can set the explana¬ 
tory variables as we want, and the goal is to bring the value of the response variable 
towards a preassigned value y 0 . The control of the response variable depends on the 
selection of explanatory variables and on the choice of a setting x' (l . Our decision 
space consists of elements a = (/, x'f that is, it consists of selecting a subset of 
explanatory variables and the values assigned to these variables. Let us assume that 
our utility function is 

u(a(6)) = -(y- y 0 f - c(x‘). 

This is no longer a two-stage sequential decision problem, because no new data 
are accrued between the choice of the subset and the choice of the setting. As seen 
in Figure 12.11 this is a nested decision problem, and it can be solved by solving for 
the setting given /, and plugging the solution back into the expected utility to solve 
for /. In a sense, this is a special case of dynamic programming where we skip one of 
the expectations. Using additional assumptions on the distributions of x and on the 
form of the cost function, in a special case Lindley (1968a) shows that if y 0 = 0 and 
the cost function does not depend on x' , then an explanatory variable x, is chosen for 
controlling if and only if 

E[9f]W ar[x ; ] > c,. 

This result parallels that seen earlier. However, a point of contrast is that the decision 
depends on the variance, rather than the mean, of the regression parameter, as a result 
of the differences in goals for the decision making. For a detailed development see 
Lindley (1968a). 


12.7 Computing 

Implementation of the fully sequential decision-theoretic approach is challenging in 
applications. In addition to the general difficulties that apply with the elicitation of 
priors and utilities in decision problems, the implementation of dynamic program¬ 
ming is limited by its computational complexity that grows exponentially with the 
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maximum number of stages S. In this section we review a couple of examples that 
give a sense of the challenges and possible approaches. 

Suppose one wants to design a clinical trial to estimate the effect 9 of an 
experimental treatment over a placebo. Prior information about this parameter is 
summarized by tt(0) and the decisions are made sequentially. There is a maximum 
of S times, called monitoring times, when we can examine the data accrued that far, 
and take action. Possible actions are a, (stop the trial and conclude that the treatment 
is better), a 2 (stop the trial and conclude that the placebo is preferred), or a 3 (continue 
the trial). At the final monitoring time S, the decision is only between actions a, and 
a 2 with utility functions 


u\am) = ~k?(e - /?,)+ 
u\a 2 m = -kf(b 2 - 9) + 

with y + = maxfO,y). This utility builds in an indifference zone (b 2 , b t ) within which 
the effect of the treatment is considered similar to that of placebo and, thus, either is 
acceptable. Low 9 are good (say, they imply a low risk) so the experimental treatment 
is preferred when 9 < b 2 , while placebo is preferred when e > bi. Also, suppose a 
constant cost C for any additional observation. 

As S increases, backwards induction becomes increasingly computationally diffi¬ 
cult. Considering the two-sided sequential decision problem described above, Carlin 
et al. (1998) propose an approach to reduce the computational complexity associated 
with dynamic programming. Their forward sampling algorithm can be used to find 
optimal stopping boundaries within a class of decision problems characterized by 
decision rules of the form 


E[9 \x m ,... ,x (I) ] < choose a x 
E[6 |x (1) ,... ,x (1 '] > Ys.u choose a 2 
Ys,l < E[0\x m ,... ,x (l) ] < Ys.u choose a 3 


for s = 1,..., S and with Ys,l = Ys.u — Ys- 

To illustrate how their method works, suppose that S = 2 and that a cost C 
is associated with carrying out each stage of the trial. To determine a sequential 
decision rule of the type considered here, we need to specify a total of (2 S — 1) 
decision boundaries. Let 


Y — (Ys-2.l, Ys—2,u > Ys-i.l, Ys—i,u s Ys)' 

denote the vector of these boundaries. To find an optimal y we first generate a 
Monte Carlo sample of size M from/(0,x (,s_1) ,x (s) ) = jt (9)f(x l ' s ~ l) \6)f(x (S) \9). Let 
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(0 m ,x ( * l) denote a generic element in the sample. For a fixed y, the utilities 
achieved with this rule are calculated using the following algorithm: 


If 

E[b] < Ys-2,l 


else if 

E[0] > Ys—2,u 


else if 

E[9\x L s-1) ] < 

Ys~l,L 

else if 

> 

Ys-i,u 

else if 

else 

Ew \x ( r\^ 

] < Ys 


u m = -kf~ 2 \e m - &,) 

u m = -kt 2 \b 2 - e,„) 

u m = -k\ s -' ] (e m - *,) - c 
u m = -kt l \b 2 - e m ) - c 
u,„ = -kf\e m - bi) - 2 C 
u,„ = -kf(b 2 - 9 m ) - 2C. 


A Monte Carlo estimate of the posterior expected utility incurred with y is 
given by 



m =1 


This algorithm provides a way of evaluating the expected utility of any strategy, and 
can be embedded into a (2 S — l)-dimensional optimization to numerically search for 
the optimal decision boundary y *. With this procedure Carlin et al. (1998) replace the 
initial decision problem of deciding among actions ai,a 2 ,a 3 by a problem in which 
one needs to find optimal thresholds that define the stopping times in the sequen¬ 
tial procedure. The strategy depends on the history of actions and observations only 
through the current posterior mean. The main advantage of the forward sampling 
algorithm is that it grows only linearly in S, while the backward induction algorithm 
would grow exponentially with S. A disadvantage is that it requires a potentially dif¬ 
ficult maximization over a continuous space of dimension (2 S — 1), while dynamic 
programming is built upon simple one-dimensional discrete maximizations over the 
set of actions. 

Kadane and Vlachos (2002) consider a hybrid algorithm which combines 
dynamic programming with forward sampling. Their hybrid algorithm works back¬ 
wards for S 0 stages and provides values of the expected utility for the optimal contin¬ 
uation. Then, the algorithm is completed by forward sampling from the start of the 
trial up to the stage S 0 when backward induction starts. While reducing the optimiza¬ 
tion problem with forward sampling, the hybrid algorithm allows for a larger number 
of stages in sequential problems previously intractable with dynamic programming. 
The trade-off between accuracy in the search for optimal strategies and computing 
time is controlled by the maximum number of backward induction steps S 0 . 

Muller et al. (2007a) propose a combination of forward simulation, to approx¬ 
imate the integral expressions, and a reduction of the allowable action space and 
sample space, to avoid problems related to an increasing number of possible trajec¬ 
tories in the backward induction. They also assume that at each stage, the choice of an 
action depends on the data portion of the history set only through a function 7\ 
of 77,_i ■ This function could potentially be any summary statistic, such as the poste¬ 
rior mean as in Carlin, Kadane & Gelfand (1998). At each stage s, both T s _ { and a L '~ 11 
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are discretized over a finite grid. Evaluation of the expected utility uses forward sam¬ 
pling. M samples ( 9 m ,x m ) are generated from the distribution/(x (1) ,... ,x <s ~ l) \0)jT(d). 
This is done once for all, and stretching out the data generation to its maximum num¬ 
ber of stages, irrespective of whether it is optimal to stop earlier or not. The value 
of the summary statistic T sjn is then recorded at stage 1 through 5—1 for every 
generated data sequence. The coarsening of data and decisions to a grid simplifies 
the computation of the expected utilities, which are now simple sums over the set 
of indexes m that lead to trajectories that belong to a given cell in the history. As 
the grid size is kept constant, the method avoids the increasing number of pathways 
with the traditional backward induction. This constrained backward induction is an 
approximate method and its successful implementation depends on the choice of the 
summary statistic T s , the number of points on the grid for 7,, and the number M 
of simulations. An application of this procedure to a sequential dose-finding trial is 
described in Berry et al. (2001). 

Brockwell and Kadane (2003) also apply discretization of well-chosen statistics 
to simplify the history H s . Their examples focus on statistics of dimension up to three 
and a maximum of 50 stages. They note, however, that sequential problems involving 
statistics with higher dimension may be still intractable with the gridding method. 
More broadly, computational sequential analysis is still an open area, where novel 
approaches and progress could greatly contribute to a more widespread application 
of the beautiful underlying ideas. 


12.8 Exercises 

Problem 12.1 (French 1988) Four well-shuffled cards, the kings of clubs, dia¬ 
monds, hearts, and spades, are laid face down on the table. A man is offered the 
following choice. Either: a randomly selected card will be turned over. If it is red, 
he will win £100; if it is black, he will lose £100. Or: a randomly selected card will 
be turned over. He may then choose to pay £35 and call the bet off or he may con¬ 
tinue. If he continues, one of the remaining three cards will be randomly selected 
and turned over. If it is red, he will win £100; if it is black, he will lose £100. Draw 
a decision tree for the problem. If the decision maker is risk neutral, what should 
he do? Suppose instead that his utility function for sums of money in this range is 
u(x) = log(l + x/200). What should he do in this case? 

Problem 12.2 (French 1988) A builder is offered two plots of land for house build¬ 
ing at £20 000 each. If the land does not suffer from subsidence, then he would expect 
to make £10 000 net profit on each plot, when a house is built on it. However, if there 
is subsidence, the land is only worth £2000 so he would make an £18 000 loss. He 
believes that the chance that both plots will suffer from subsidence is 0.2, the chance 
that one will suffer from subsidence is 0.3, and the chance that neither will is 0.5. 
He has to decide whether to buy the two plots or not. Alternatively, he may buy one, 
test it for subsidence, and then decide whether to buy the other plot. Assuming that 
the test is a perfect predictor of subsidence and that it costs £200 for the test, what 
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should he do? Assume that his preferences are determined purely by money and that 
he is risk neutral. 

Problem 12.3 (French 1988) You have decided to buy a car, and have eliminated 
possibilities until you are left with a straight choice between a brand new car costing 
£4000 and a second-hand car costing £2750. You must have a car regularly for work. 
So, if the car that you buy proves unreliable, you will have to hire a car while it 
is being repaired. The guarantee on the new car covers both repair costs and hire 
charges for a period of two years. There is no guarantee on the second-hand car. You 
have considered the second-hand-car market and noticed that cars tend to be either 
very good buys or very bad buys, few cars falling between the two extremes. With 
this in mind you consider only two possibilities. 

(i) The second-hand car is a very good buy and will involve you in £750 
expenditure on car repairs and car hire over the next two years. 

(ii) The second-hand car is a very bad buy and will cost you £1750 over the next 
two years. 

You also believe that the probability of its being a good buy is only 0.25. FIow- 
ever, you may ask the AA for advice at negligible financial cost, but risking a 
probability of 0.3 that the second-hand car will be sold while they are arranging 
a road test. The road test will give a satisfactory or unsatisfactory report with the 
following probabilities: 


Probability of 

Conditional on the car being: 

A A report being: 



a very bad buy 

a very good buy 

Satisfactory 

0.1 

0.9 

Unsatisfactory 

0.9 

0.1 


If the second-hand car is sold before you buy it, you must buy the new car. Alter¬ 
natively you may ask your own garage to examine the second-hand car. They can do 
so immediately, thus not risking the loss of the option to buy, and will also do so at 
negligible cost, but you do have to trust them not to take a “back-hander” from the 
second-hand-car salesman. As a result, you evaluate their reliability as: 


Probability of 
garage report being: 

Conditional on the car being: 

a very bad buy 

a very good buy 

Satisfactory 

0.5 

1 

Unsatisfactory 

0.5 

0 
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You can arrange at most one of the tests. What should you do, assuming that 
you are risk neutral for monetary gains and losses and that no other attribute is of 
importance to you? 

Problem 12.4 Write the algorithm of Section 12.3.2 for the case in which z only 
depends on the unknown 6 and the terminal action, and in which observation of each 
of the random variables in stages 1,..., S — 1 has cost C. 

Problem 12.5 Referring to Section 12.4.1: 

1. Prove equation (12.15). 

2. Use backwards induction to prove equation (12.16). 


Problem 12.6 Consider the example discussed in Section 12.4.2. Suppose that 
0 m ,..., 0 ,s> are independent and identically distributed with a uniform distribution 
on [0,1]. Prove that 


h s 


S 

S+ 1 


and 


w s = -(1 + w 2 s _ j) > 


5 + 1 
5 + 3 ' 
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Changes in utility 
as information 


In previous chapters we discussed situations in which a decision maker, before mak¬ 
ing a decision, has the opportunity to observe data x. We now turn to the questions 
of whether this observation should be made and how worthwhile it is likely to be. 
Observing x could give information about the state of nature and, in this way, lead 
to a better decision; that is, a decision with higher expected utility. In this chapter 
we develop this idea more formally and present a general approach for assessing 
the value of information. Specifically, the value of information is quantified as the 
expected change in utility from observing x, compared to the “status quo” of not 
observing any additional data. This approach permits us to measure the information 
provided by an experiment on a metric that is tied to the decision problem at hand. 
Our discussion will follow Raiffa and Schlaifer (1961) and DeGroot (1984). 

In many areas of science, data are collected to accumulate knowledge that will 
eventually contribute to many decisions. In that context the connection outlined 
above between information and a specific decision is not always useful. Motivated 
by this, we will also explore an idea of Lindley (1956) for measuring the information 
in a data set, which tries to capture, in a decision-theoretic way, “generic learning” 
rather than specific usefulness in a given problem. 

Featured articles: 

Lindley, D. V. (1956). On a measure of the information provided by an experiment, 
Annals of Mathematical Statistics 27: 986-1005. 

DeGroot, M. H. (1984). Changes in utility as information. Theory and Decision 
17: 287-303. 

Useful references are Raiffa and Schlaifer (1961) and Raiffa (1970). 


Decision Theory: Principles and Approaches G. Parmigiani, L. Y. T. Inoue 
© 2009 John Wiley & Sons, Ltd 
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13.1 Measuring the value of information 

13.1.1 The value function 

The first step in quantifying the change in utility resulting from a change in knowl¬ 
edge is to describe the value of knowledge in absolute terms. This can be done in 
the context of a specific decision problem, for a given utility specification, as fol¬ 
lows. As in previous chapters, let a* denote the Bayes action, that is a* maximizes 
the expected utility 

UJa) = I u(a(0))n(9)dQ. (13.1) 

Je 

As usual, a is an action, u is the utility function, 9 is a parameter with possible values 
in a parameter space ©, and n is the decision maker’s prior distribution. Expectation 
(13.1) is taken with respect to it. We need to keep track of this fact in our discussion, 
and we do so using the subscript n on U . The amount of utility we expect to achieve 
if we decide to make an immediate choice without experimentation, assuming we 
choose the best action under prior tc, is 

V(n) = sup U n (a). (13.2) 

a 

This represents, in absolute terms, the “value” to the decision maker of solving the 
problem as well as possible, given the initial state of knowledge. We illustrate V(jt) 
using three stylized statistical decision problems: 

1. Consider an estimation problem where a represents a point estimate of 9 and 
the utility is 

u(a(9)) = —{0 — a) 2 , 

that is the negative of the familiar squared error loss. The optimal decision is 
a* = E\9] as we discussed in Chapter 7. Computing the expectation in (13.2) 
gives 

V(tt) = - [ (6- a*) 2 n(9)d9 = - [ (0 - E[d}) 2 n(9)d6 = -Var[(9]. 

Je> Je 

The value of solving the problem as well as possible given the knowledge 
represented by jt is the negative of the variance of 9. 

2. Next, consider the discrete parameter space 0 = {1,2,...} and imagine that 
the estimation problem is such that we gain something only if our estimate is 
exactly right. The corresponding utility is 

u(a(9)) = I {a=e) , 

where, again, a represents a point estimate of 9. The expected utility is max¬ 
imized by a* = mode(fl) = 9° (use your favorite tie-breaking rule if there is 
more than one mode ) and 
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V(i r) = n(0°). 

The value V is now the largest mass of the prior distribution. 

3. Last, consider forecasting 0 when forecasts are expressed as a probability 
distribution, as is done, for example, for weather reports. Now a is the whole 
probability distribution on 0. This is the problem we discussed in detail in 
Chapter 10 when we talked about scoring rules. Take the utility to be 

u(ci(9)) — log(a(0)). 

This utility is a proper local scoring rule, which implies that the optimal 
decision is a*(9) = tc(0). Then, 

V(tt) = / log(jr(0)M0>/0. 

Je 

The negative of this quantity is also known as the entropy (Cover and Thomas 
1991) of the probability distribution jr(0). A high value of V means low 
entropy, which corresponds to low variability of 0. 

In all these examples the value V of the decision problem is directly related to a 
measure of the strength of the decision maker’s knowledge about 0, as reflected by 
the prior it. The specific aspects of the prior that determine V depend on the char¬ 
acteristics of the problem. All three of the above problems are statistical decisions. 
In Section 13.2.2 we discuss in detail an example using a two-stage decision tree 
similar to those of Chapter 12, in which the same principles are applied to a practical 
decision problem. 

The next question for us to consider is how V changes as the knowledge embod¬ 
ied in it changes. To think of this concretely, consider example 2 above, and suppose 
priors rt\ and tx 2 are as follows: 


0 

7T l 

it 2 

-1 

0.5 

0.1 

0 

0.4 

0.4 

1 

0.1 

0.5 


For a decision maker with prior jt u the optimal choice is —1 and V(tt \) = 0.5. 
For one with prior : r 2 , the optimal choice is 1 and V(jt t ) = 0.5. For a third with prior 
ajH+Cl — ot)n 2 with, say, a = 0.5, the optimal choice is 0, with V{ctJti+{\ —a)n 2 ) = 
0.4, a smaller value. The third decision maker is less certain about 0 than any of the 
previous two. In fact, the prior otn^ + (1 — a)n 2 is hedging bets between n 1 and jt 2 
by taking a weighted average. It therefore embodies more variability than either one 
of the other two. This is why the third decision maker expects to gain less than the 
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others from having to make an immediate decision, and, as we will see later, may be 
more inclined to experiment. 

You can check that no matter how you choose a in (0,1), you cannot get V to be 
above 0.5, so the third decision maker expects to gain less no matter what a is. This 
is an instance of a very general inequality which we consider next. 

Theorem 13.1 The function V(jt ) is convex; that is, for any two distributions 
ni and 7 X 2 on 6 and 0 < a < 1 

V(<oi7t l + (1 — (x)jx 2 ) < aV(i n) + (1 — a)V(n 2 ). (13.3) 

Proof: The main tool here is a well-known inequality from calculus, which says that 
sup{/i(jc) +f 2 (x)} < sup/i (x) + sup/ 2 (jt). Applying this to the left hand side of (13.3) 
we get 


V{aix l + (1 — u)jx 2 ) 


= sup / u(a(6))[ajXi (9) + (1 — a)jr 2 (d)]dd 

aeA J 0 

= sup a I 

aeA L t /0 

L 


u(a(0))jx l (0)d6 + (1 — a) I u(a(0))ji 2 (O)dd 


L 


sup / u(a(Q))nf Q)d6 

aeA J 0 


= aV(it i) + (1 - a)V(7t 2 ). 


+ d-«) 


sup / u{a(0))TX 2 (6)d6 
aeA J@ 


□ 

We will use this theorem later when measuring the expected change in V that 
results from observing additional data. 

13.1.2 Information from a perfect experiment 

So far we quantified the value of solving the decision problem at hand based on 
current knowledge, as captured by it. At the opposite extreme we can consider the 
value of solving the decision problem after having observed an ideal experiment that 
reveals the value of 6 exactly. We denote experiments in general by £, and this ideal 
experiment by £°°. We define a„ as an action that would maximize the decision’s 
maker utility if 6 was known. Formally, for a given 9, 

a e = arg sup u(a{9)). (13.4) 

aeA 

Because 9 is unknown, so is a e . For any given 6 , the difference 


V e {£°°) = u(a e m - u{a*m 


( 13 . 5 ) 
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measures the gap between the best that can be achieved under the current state of 
knowledge and the best that can be achieved if 0 is known exactly. This is called con¬ 
ditional value of perfect information, since it depends on the specific 0 that is chosen. 
What would the value be to the decision maker of knowing 0 exactly? Because dif¬ 
ferent 0 will change the utility by different amounts, it makes sense to answer this 
question by computing the expectation of (13.5) with respect to the prior n, that is 

V{£°°) = E e [Ve(£°°)l (13.6) 

which is called expected value of perfect information. Using equations (13.2), (13.4), 
and (13.5) we can, equivalently, rewrite 


V(£°°) = E e 


sup u(a(0)) 


- V(7T). 


(13.7) 


The first term on the right hand side of equation (13.7) is the prior expectation of 
the utility of the optimal action given perfect information on 0. Section 13.2.2 works 
through an example in detail. 


13.1.3 Information from a statistical experiment 

From a statistical viewpoint, interesting questions arise when one can perform an 
experiment £ which may be short of ideal, but still potentially useful. Say £ con¬ 
sists of observing a random variable x, with possible values in the sample space X, 
and whose probability density function (or probability mass) is f(x\0). The tree in 
Figure 13.1 represents two decisions: whether to experiment and what action to take. 
We can solve this two-stage decision problem using dynamic programming, as we 
described in Chapter 12. The solution will tell us what action to take, whether we 
should perform the experiment before reaching that decision, and, if so, how the 
results should affect the subsequent choice of an action. 

Suppose the experiment has cost c. The largest value of c such that the experi¬ 
ment should be performed can be thought of as the value of the experiment for this 
decision problem. From this viewpoint, an experiment has value only if it may affect 
subsequent decisions. How large this value is depends on the utilities assigned to the 
final outcome of the decision process, on the strength of the dependence between x 
and 6 , and on the probability of observing the various outcomes of x. We now make 
this more formal. 

The “No Experiment” branch of the tree was discussed in Section 13.1.1. In sum¬ 
mary, if we do not experiment, we choose an action a that maximizes the expected 
utility (13.1). The amount of utility we expect to gain is given by (13.2). When we 
have the option to perform an experiment, two questions can be considered: 

1. After observing x, how much did we learn about 01 

2. How much do we expect to learn from x prior to observing it? 
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Figure 13.1 Decision tree for a generic statistical decision problem with discrete x. 
The decision node on the left represents the choice of whether or not to experiment, 
while the ones on the right represent the terminal choice of an action. 


Question 2 is the relevant one for solving the decision tree. Question 1 is interesting 
retrospectively, as a quantification of observed information: answering question 1 
would provide a measure of the observed information about 9 provided by observing 
x. To simplify the notation, let n x denote the posterior probability density function (or 
posterior probability mass in the discrete case) of 9 , that is tt x = n(9\x). A possible 
answer to question 1 is to consider the observed change in expected utility, that is 

V(n x ) - V (tv ). (13.8) 

However, it is possible that the posterior distribution of 9 will leave the decision 
maker with more uncertainty about 6 or, more generally, with a less favorable sce¬ 
nario. Thus, with definition (13.8), the observed information can be both positive 
and negative. Also, it can be negative in situations where observing x was of great 
practical value, because it revealed that we knew far less about 9 than we thought 
we did. 

Alternatively, DeGroot (1984) proposed to define the observed information as 
the expected difference, calculated with respect to the posterior distribution of 9, 
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between the expected utility of the Bayes decision and the expected utility of the 
decision a* that would be chosen if observation x had not been available. Formally 


V,(£) = V(n x ) - U nx (a*\ 


(13.9) 


where a* represents a Bayes decision with respect to the prior distribution n. The 
observed information is always nonnegative, as V(tt x ) = ma x a U nx (a). Expression 
(13.9) is also known as the conditional value of sample information (Raiffa and 
Schlaifer 1961) or conditional value of imperfect information (Clemen and Reilly 
2001 ). 

Let us explore further the definition of observed information in two of the stylized 
cases we considered earlier. 

1. We have seen that in an estimation problem with utility function u(a(9)) = 
—{9 — a) 2 , we have V(tt) = —Var(0). The observed information (using 
equation (13.9)) is 



= — Var[0|x] + (Var[<9|x] + {E[9\x\ - £[<9]) 2 ) 
= {E[9\x\-E[6})\ 


that is the square of the change in mean from the prior to the posterior. 

2. Take now u(a(9)) = / |a=e) , and let 9° and 9° be the modes for n and n x , 
respectively. As we saw, V(n) = tt(9°). Then 


V X (£) = V(n x ) - Z4>*) = jt x (9° x ) - tx x (9°). 


the difference in the posterior probabilities of the posterior and prior modes. 
This case is illustrated in Figure 13.2. 

We can now move to our second question: how much do we expect to learn from 
x prior to observing it? The approach we will follow is to compute the expectation of 
the observed information, with respect to the marginal distribution of x. Formally 


V(£) = E X [V X (£)] = E x [V(jt x ) - U nx (cO\. 


( 13 . 10 ) 
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V(7T) 


O 


o Prior 
x Posterior 


V(7t x ) 


■ x 


X 


X 

o 


o 


2 


3 e 


Figure 13.2 Observed information for utility u(a{0)) = / (a=e} , when © = {1,2,3}. 
Observing x shifts the probability mass and also increases the dispersion, decreasing 
the point mass at the mode. In this case the value V of the decision problem is higher 
before the experiment, when available evidence indicated 0 = 1 as a very likely 
outcome, than it is afterwards. However, the observed information is positive because 
the experiment suggested that the decision a — 9 = 1, which we would have chosen 
in its absence, is not likely to be a good one. 


An important simplification of this expression can be achieved by expanding the 
expectation in the second term of the right hand side: 







= Z4(a*) 

= 


(13.11) 


where m(x) is the marginal distribution of x. Replacing the above into the definition 
of V(£), we get 


V(£) = E x [V(j rj] - V(i r). 


(13.12) 
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This expression is also known as the expected value of sample information (Raiffa 
and Schlaifer 1961). Note that this is the same as the expectation of (13.8). 

Now, because E x \rc x \ — 7t, we can write 

V(£) = EAV(it x )] - V(E x [rt x ] ). (13.13) 

From the convexity of V (Theorem 13.1) and Jensen’s inequality, we can then derive 
the following: 

Theorem 13.2 V(£) > 0. 

This inequality means that no matter what decision problem one faces, expected 
utility cannot be decreased by taking into account free information. This very gen¬ 
eral connection between rationality and knowledge was first brought out by Raiffa 
and Schlaifer (1961), and Good (1967), whose short note is very enjoyable to read. 
Good also points out that in the discrete case considered in his note, the inequality 
is strict unless there is a dominating action: that is, an action that would be chosen 
no matter how much experimentation is done. Later, analysis of Ramsey’s personal 
notes revealed that he had already noted Theorem 13.2 in the 1920s (Ramsey 1991). 

We now turn to the situation where the decision maker can observe two random 
variables, X\ and x 2 , potentially in sequence. We define £, and S 2 as the experiments 
corresponding to observation of x, and x 2 respectively, and define £ n as the exper¬ 
iment of observing both. Let jr A|A , be the posterior after both random variables are 
observed. Let a* a be an optimal action when the prior is tc and a\ be an optimal 
action when the prior is :r xi . The observed information if both random variables are 
observed at the same time is, similarly to expression (13.9), 

V xixi (£n) = V(n xixi ) - U nxm {al). (13.14) 

If instead we first observe Xi and revise our prior to n xi , the additional observed 
information from observing that x 2 can be defined is 


V^telf,) = V(jt xlX2 ) - u nxm (a\). (13.15) 

In both cases the final results depend on both x t and x 2 , but in the second case, the 
starting point includes knowledge of x 2 . 

Taking the expectation of (13.14) we get that the expected information from S l2 is 

V(£ n ) = E xlX2 [V(7t xlX2 )] - V(jt ). (13.16) 

Next, taking the expectation of (13.15) we get that the expected conditional 
information is 


V(£ 2 |£,) = E xm [V(Tt xm )-U^ xi (aX)\ 

= E XXX2 [V(ti HX2 )\ - E XI [V(jz X i )\. 


(13.17) 
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This measures the expected value of observing x 2 conditional on having already 
observed x \. The following theorem gives an additive decomposition of the expected 
information of the two experiments. 

Theorem 13.3 

V(£ 12 ) = V(£ 1 ) + V(£ 2 |6). (13.18) 


Proof: 


V(£n) = E XIX2 [V(tt X]X 2 )] - V(n) 

= E HX2 [V(n xlxl )} - E X] [V(n x J] + E Xi [V(jt xi )] - V(jt) 

= V(£ 2 \£ l ) + V(£ l ). 

□ 

It is important that additivity is in V(£ 2 \£ t ) and V(£, ), rather than V(£ 2 ) and 
V(£, ). This reflects the fact that we accumulate knowledge incrementally and that the 
value of new information depends on what we already know about an unknown. Also, 
in general, V{£ 2 \££) and V(£, ) will not be the same even if r, and x 2 are exchangeable. 


13.1.4 The distribution of information 


The observed information V, depends on the observed sample and it is unknown 
prior to experimentation. So far we focused on its expectation, but it can be useful to 
study the distribution of V x , both for model validation and design of experiments. 

We illustrate this using an example based on normal data. Suppose that a random 
quantity x is drawn from a N(0, cr 2 ) distribution, and that the prior distribution for 6 
is N(ii 0 , T 2 ). The posterior distribution of 0 given x is a N{fx x , r 2 ) distribution where, 
as given in Appendix A.4, 


/A = 


ct 2 + r, 


Mo 


cr 2 + T 2 


and 


cr 2 + r 2 


and the marginal distribution of x is A(/i„, cr 2 + r 0 2 ). Earlier we saw that with the 
utility function u(a(Q)) — —(6 — a) 2 , we have 


VA£) = (E[e\x\ - E[6]) 2 . 


Thus, in this example, V x (£) = ( /i x — /z 0 ) 2 . The difference in means is 


M* - Mo 


cr 2 + r n 


Mo ■ 


■ x - Mo 


r (x- Mo). 


Since (x — Mo) ~ N (0, a 2 + r 0 2 ), we have 


M^ - Mo ~ N 0, 
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Therefore, 


-(M, - Mo f ~ X, 2 


Knowing the distribution of information gives us a more detailed view of what we 
can anticipate to learn from an observation. Features of this distribution could be used 
to discriminate among experiments that have the same expected utility. Also, after 
we make the observation, comparing the observed information to the distribution 
expected a priori can provide an informal overall diagnostic of the joint specification 
of utility and probability model. Extreme outliers would lead to questioning whether 
the model we specified was appropriate for the experiment. 


13.2 Examples 

13.2.1 Tasting grapes 

This is a simple example taken almost verbatim from Savage (1954). While obvi¬ 
ously a bit contrived, it is useful to make things concrete, and give you numbers 
to play around with. A decision maker is considering whether to buy some grapes 
and, if so, in what quantity. To his or her taste, grapes can be classified as of poor, 
fair, or excellent quality. The unknown 0 represents the quality of the grapes and can 
take values 1,2, or 3, indicating increasing quality. The decision maker’s personal 
probabilities for the quality of the grapes are stated in Table 13.1. 

The decision maker can buy 0,1,2, or 3 pounds of grapes. This defines the basic 
or terminal acts in this example. The utilities of each act according to the quality of 
the grapes are stated in Table 13.2. Buying 1 pound of grapes maximizes the expected 
utility U n (a). Thus, a* = 1 is the Bayes action with value V(jt) = 1. 


Table 13.1 Prior probabilities for the quality of the grapes. 


Quality 

Prior probability 

1 

1/4 

2 

1/2 

3 

1/4 

Table 13.2 Utilities associated with each action and quality 
of the grapes. 

a 

e 


U n (ci) 

i 

2 

3 


0 0 

0 

0 

0 

1 -1 

1 

3 

1 

2 -3 

0 

5 

1/2 

3 -6 

-2 

6 

-1 
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Table 13.3 Joint probabilities (multiplied 
by 128) of quality 0 and outcome x. 


X 


e 


1 

2 

3 

1 

15 

5 

1 

2 

10 

15 

2 

3 

4 

24 

4 

4 

2 

15 

10 

5 

1 

5 

15 


Table 13.4 Expected utility of action a (in pounds of grapes) 
given outcome x. For each x, the highest expected utility is in 
italics. 


a 



X 




1 

2 

3 

4 

5 

0 

0/21 

0/27 

0/32 

0/27 

0/21 

1 

-7/21 

11/27 

32/32 

43/27 

49/21 

2 

-40/21 

-20/21 

8/32 

44/27 

72/21 

3 

-94/21 

-78/27 

-48/32 

18/27 

74/21 


Suppose the decision maker has the option of tasting some grapes. How much 
should he or she pay for making this observation? Suppose that there are five possible 
outcomes of observation x, with low values of x suggesting low quality. Table 13.3 
shows the joint distribution of x and 6 . 

Using Tables 13.2 and 13.3 it can be shown that the conditional expectation of 
the utility of each action given each possible outcome is as given in Table 13.4. The 
highest value of the expected utility V(tt x ), for each x, is shown in italics. Averaging 
with respect to the marginal distribtion of x from Table 13.3, E x [V{n x )] = 161/128 ~ 
1.26. The decision maker would pay, if necessary, up to V(£) = E x [V{n x )\ — V(jt) ~ 
1.26 — 1.00 = 0.26 utilities for tasting the grapes before buying. 

13.2.2 Medical testing 

The next example is a classic two-stage decision tree. We will use it to illustrate 
in a simple setting the value of information analysis, and to show how the value of 
information varies as a function of prior knowledge about 0. Return to the travel 
insurance example of Section 7.3. Remember you are about to take a trip overseas. 
You are not sure about the status of your vaccination against a certain mild disease 
that is common in the country you plan to visit, and need to decide whether to buy 
health insurance for the trip. We will assume that you will be exposed to the disease. 
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Table 13.5 Hypothetical monetary consequences of buying 
or not buying an insurance plan for the trip to an exotic country. 


Actions 


Events 


0i: ill 

0 2 : not ill 

Insurance 

-50 

-50 

No insurance 

-1000 

0 


but you are uncertain about whether your present immunization will work. Based on 
aggregate data on tourists like yourself, the chance of developing the disease during 
the trip is about 3% overall. Treatment and hospital abroad would normally cost you, 
say, 1000 dollars. There is also a definite loss in quality of life in going all the way to 
a foreign country and being grounded at a local hospital instead of making the most 
out of your experience, but we are going to ignore this aspect here. On the other 
hand, if you buy a travel insurance plan, which you can do for 50 dollars, all your 
expenses will be covered. This is a classical gamble versus sure outcome situation. 

Table 13.5 presents the outcomes for this decision problem. We are going to 
analyze this decision problem assuming both a risk-neutral and a risk-averse utility 
function for money. In the risk-neutral case we can simply look at the monetary out¬ 
come, so when we talk about “utility” we refer to the risk-averse case. We will start 
by working out the risk-neutral case first and come back to the risk-averse case later. 

If you are risk neutral, based on the overall rate of disease of 3% in tourists, you 
should not buy the insurance plan because this action has an expected loss of 30 
dollars, which is better than a loss of 50 dollars for the insurance plan. Actually, you 
would still choose not to buy the insurance for any value of 7 t(6i) less than 0.05, as 
the observed expected utility (or expected return) is 


|-1000 7r ( 0 !) 
[-50 


if 7T (0 \) < 0.05 
if jr(0!) > 0.05. 


(13.19) 


How much money could you save if you knew exactly what will happen? That 
amount is the value of perfect information. The optimal decision under perfect infor¬ 
mation is to buy insurance if you know you will become ill and not to buy it if you 
know you will stay healthy. In the notation of Section 13.1.2 


a g (0) = 


{ insurance 
no insurance 


if 0 = 0! 
if <9 = 0 2 . 


The returns that go with the two cases are 


u(a e (9 )) = 


-50 

0 


if 6 = 0! 
if 0 = 0 2 
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so that the expected return under perfect information is —50 Jtidi). In our example, 
—50 x 0.03 = —1.5. The difference between this amount and the expected return 
under current information is the expected value of perfect information. For a general 
prior, this works out to be 


V(£“) =-50 *(00 - 7(*) = 


1950 *(00 
[50(1 -*(00) 


if*(00 < 0.05 
if *(00 > 0.05. 


(13.20) 


An alternative way of thinking about the value of perfect information as a func¬ 
tion of the initial prior knowledge on 0 is presented in Figure 13.3. The expected 
utility of not buying the insurance plan is —1000 *(00, while the expected util¬ 
ity of buying insurance is —50. The expected value of perfect information for a 
given *(00 is the difference between the expected return with perfect information 
and the maximum between the expected returns with and without insurance. The 
expected value of perfect information increases for *(00 < 0.05 and it decreases for 
*(00 > 0.05. 

When *(00 = 0.03, the expected value of perfect information is 950 x 0.03 = 
28.5. In other words, you would be willing to spend up to 28.5 dollars for an infallible 
medical test that can predict exactly whether you are immune or not. In practice, it 



Figure 13.3 The expected utility of no insurance (solid line) and insurance (dashed 
line) as a function of the prior *(00- The dotted line, a weighted average of the two, 
is the expected utility of perfect information, that is E g [u(a e (9))]. The gap between 
the dotted line and the maximum of the lower two lines is the expected value of 
perfect information. As a function of the prior, this is smallest at the extremes, where 
we think we know a lot, and greatest at the point of indifference between the two 
actions, where information is most valued. 
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is rare to have access to perfect information, but this calculation gives you an upper 
bound for the value of the information given by a reliable although not infallible, 
test. We turn to this case now. 

Suppose you have the option of undergoing a medical test that can inform you 
about whether your immunization is likely to work. After the test, your individual 
chances of illness will be different from the overall 3%. This test costs c dollars, so 
from what we know about the value of perfect information, c has to be less than or 
equal to 28.5 for you to consider this possibility. The result x of the test will be either 
1 (for high risk or bad immunization) or 0 (for low risk or good immunization). From 
past experience, the test is correct in 90% of the subjects with the bad immunization 
and 77% of subjects with good immunization. In medical terms, these figures repre¬ 
sent the sensitivity and specificity of the test (Lusted 1968). Formally they translate 
into/(x = 1 \0\) — 0.9 and f(x = 1 |6f ) = 0.23. Figure 13.4 shows the solution of this 
decision problem using a decision tree, assuming that c = 10. The optimal strategy 



Figure 13.4 Solved decision tree for the two-stage insurance problem. 
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is to buy the test. If jc = 1, then the best strategy is buying the insurance. Otherwise, 
it is best not to buy it. If no test was made, then not buying the insurance would be 
best, as observed earlier. 

The test result x can alter your course of action and therefore it provides 
valuable information. The expected value of the information in x is defined by equa¬ 
tion (13.12). Not considering cost for the moment, if x = 1 the optimal decision is 
to buy the insurance and that has conditional expected utility of —50; if x = 0 the 
optimal decision is not to buy insurance, which has conditional expected utility of 
—4. So in our example the first term in expression (13.12) is 

E x [V(it x )\ = —50 x m(X = 1) — 4 x m(X = 0) 

= -50 x 0.25 - 4 x 0.75 
= -15.5, 


and thus the expected value of the information provided by the test is V(£) = 
— 15.5 — (—30) = 14.5. This difference exceeds the cost, which is 10 dollar, which 
is why according to the tree it is optimal to buy the test. 

We now explore, as we had done in Figure 13.3, how the value of information 
changes as a function of the prior information on 0. We begin by evaluating V(ir x ) 
for each of the two possible experimental outcomes. From (13.19) 


V(tt x= i) 


J —1000 tc{6i\x = 1) if Tt{0i\x = 1) < 0.05 
{ —50 if 7 t(0i\x = 1) > 0.05 


and 


V(7t x=0 ) = 


|-looo 7z{e l \x = o) 

[-50 


ifn{Q x \x = 0) < 0.05 
if ic{0i\x = 0) > 0.05. 


Replacing Bayes’ formula for ntO, |x) and solving the inequalities, we can rewrite 
these as 


and 


V(n x=l ) 


j —1000 n(9i\x — 1) if tt^) < 0.013 
[-50 if 7t(di) > 0.013 


[ — 1000 7r(6V|x — 0) 

[-50 


if ?r(0i) < 0.288 
if 7r(0!) > 0.288. 


Averaging these two functions with respect to the marginal distribution of jc we get 


-1000 71(0!) if 7T(0!) < 0.013 

-11.5 - 133.5 t r(6>!) if 0.013 < i r(6\) < 0.288 
-50 if tt(0!) > 0.288. 


E x [V(n x )] = 


(13.21) 
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The intermediate range in this expression is the set of priors such that the optimal 
solution can be altered by the results of the test. By contrast, values outside this 
interval are too far from the indifference point jx((f ) = 0.05 for the test to make a 
difference: the two results of the test lead to posterior probabilities of illness that are 
both on the same side of 0.05, and the best decision is the same. 

Subtracting (13.19) from (13.21) gives the expected value of information 


V(£) = E x [V(n x )] - V(7t) = 


-11.5 + 866.5 7r(6\) 
38.5 - 133.5 7t(0i) 

0 


if Tt(0{) < 0.013 
if 0.013 < 7r (0,) < 0.05 
if 0.05 < t r(0O < 0.288 

if 71(00 > 0.288. 

(13.22) 


Figure 13.5 graphs V(£ ) and V(£°°) versus the prior 7r(0!). At the extremes, evidence 
from the test is not sufficient to change a strong initial opinion about becoming or 
not becoming ill, and the test is not expected to contribute valuable information. As 
was the case with perfect information, the value of the test is largest at the point of 
indifference 71(00 = 0.05. 

As a final exploration of this example, let us imagine that you are slightly averse 
to risk, as opposed to risk neutral. Specifically, your utility function for a change in 
wealth of z dollars is of the form 


u{z) — a — b e kz , b, X > 0. 


(13.23) 
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Figure 13.5 Expected value of information versus prior probability of illness. The 
thin line is the expected value of perfect information, and corresponds to the vertical 
gap between the dotted line and the maximum of the lower two lines in Figure 13.3. 
The thicker line is the expected value of the information provided by the medical test. 
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In the terminology of Chapter 4 this is a constantly risk-averse utility. To elicit your 
utility you consider a lottery where you win $1010 with probability 0.5, and nothing 
otherwise, and believe that you are indifferent between $450 and the lottery ticket. 
Then you set u( 0) = 0, n(1010) = 1, and w(450) = 0.5. Solving for the parameters 
of your utility function gives a = b = 2.813 721 8 and /, = 0.000434779. Using 
equation (13.23) we can extrapolate the utilities to the negative monetary outcomes 
in the decision tree of Figure 13.4. Those are 

Outcomes in dollars —10 —50 —60 —1000 —1010 

Utility -0.012 -0.062 -0.074 -1.532 -1.551 


Solving the tree again, the best decision in the absence of testing continues to be do 
not buy the insurance plan. In the original case, not buying the insurance plan was the 
best decision as long as tc(6 \) was less than 0.05. Now such a cutoff has diminished 
to about 0.04 (give or take rounding error) as a result of your aversion to the risk of 
spending a large sum for the hospital bill. Finally, the overall optimal strategy is still 
to buy the test and buy insurance only if the test is positive. 

The story about the trip abroad is fictional, but it captures the essential features 
of medical diagnosis and evaluation of the utility of a medical test. Real applications 
are common in the medical decision-making literature (Lusted 1968, Barry et al. 
1986, Sox 1996, Parmigiani 2002). The approach we just described for quantifica¬ 
tion of the value of information in medical diagnosis relies on knowing probabilities 
such as 7 t(0) and/(x|0) with certainty. In reality these will be estimates and are also 
uncertain. While for a utility-maximizing decision maker it is appropriate to aver¬ 
age out this uncertainty, there are broader uses of a value of information analysis, 
such as supporting decision making by a range of patients and physicians who are 
not prepared to solve a decision tree every time they order a test, or to support reg¬ 
ulatory decisions. With this in mind, it is interesting to represent uncertainty about 
model inputs and explore the implication of such uncertainty on the conclusions. 
One approach to doing this is probabilistic sensitivity analysis (Doubilet et al. 1985, 
Critchfield and Willard 1986), which consists of drawing a random set of inputs 
from a distribution that reflects the uncertainty about them (for example, is consis¬ 
tent with confidence intervals reported in the studies that were used to determine 
those quantities), and evaluating the value of information for each. 

Figure 13.6 shows the results of probabilistic sensitivity analysis applied to a 
decision analysis of the value of genetic testing for a cancer-causing gene called 
BRCA1 in guiding preventative surgery decisions. Details of the underlying model¬ 
ing assumptions are discussed by Parmigiani et al. (1998). Figure 13.6 reminds us of 
Figure 13.5 with additional noise. There is little variability in the range of priors for 
which the test has value, while there is more variability in the value of information 
to the left of the mode. As was the case in our simpler example above, the portions 
of the curve on either side of the mode are affected by different input values. 

In Figure 13.6 we show variability in the value of information as inputs vary. In 
general, uncertainty about inputs translates into uncertainty about optimal decisions 
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Figure 13.6 Results of a probabilistic sensitivity analysis of the additional expected 
quality adjusted life resulting from genetic testing for the cancer-causing gene 
BRCA1. Figure from Parmigiani el al. (1998). 

at all stages of a multistage problem. Because of this, in a probabilistic sensitivity 
analysis of the value of diagnostic information it can be interesting to separately 
quantify and convey the uncertainty deriving directly from randomness in the input, 
and the uncertainty deriving from randomness in the optimal future decisions. In 
clinical settings, this distinction can become critical, because it has implications for 
the appropriateness of a treatment. A detailed discussion of this issue is in Parmigiani 
(2004). 


13.2.3 Hypothesis testing 

We now turn to a more statistical application of value of information analysis, and 
work out in detail an example of observed information in hypothesis testing. Assume 
that Xi,... ,x 9 form a random sample from a N(0, 1) distribution. Let the prior for 0 
be N( 0,4) and the goal be that of testing the null hypothesis H 0 : 6 < 0 versus the 
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Table 13.6 Utility function for the hypotheses testing 


example. 

Actions 

States of nature 



H 0 


a 0 = accept H 0 

0 

-1 

a t = accept //, 

-3 

0 


alternative hypothesis H { \ 6 > 0. Set Jt(H 0 ) = jt(Hi) = 1/2. Let a, denote the 
action of accepting hypothesis H h i = 0,1, and assume that the utility function is 
that shown in Table 13.6. 

The expected utilities for each decision are: 

• for decision a 0 : 


UJa 0 ) = u(a 0 (H 0 ))7T(H 0 ) + u{a 0 {H{))n(H{) 
= Ojt(H 0 )~ 1 n{H{) = -1/2, 


• for decision a .\: 


U n {a i) = u(a ] (H 0 ))t:(H 0 ) + 

= -3 jr(H 0 ) + 0 = -3/2, 


with V(tt) = swp a ^ A U„(ci) — —1/2. Based on the initial prior, the best decision is 
action a 0 . The two hypotheses are a priori equally likely, but the consequences of a 
wrong decision are worse for a { than they are for a 0 . 

Using standard conjugate prior algebra, we can derive the posterior distribution 
of 6 , which is 

/ 36 4 \ 

=*( 37 * 37 ) 

which in turn can be used to compute the posterior probabilities of H 0 and H { as 


n(H 0 \x!') = 7 r(6 < 0|x") = O 


-183c 

V37 


n(H ilx") = it(9 > 0|x") = 



where <t>(x) is the cumulative distribution function of the standard normal distribu¬ 
tion. After observing x" = (xi,... ,x 9 ), the expected utilities of the two alternative 
decisions are: 
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• for decision a 0 : 


U„ x (a 0 ) = u(a 0 (H 0 ))7t x (H 0 ) + u(a {) (H^)n x (H^ 
= 0 JtAH 0 ) - 1 t[ x (H i ) 

/-18x\ 


= O 


V V37 7 


- 1 . 


• for decision a\. 


KM) = M(h 0 ))jt x (h 0 ) + uiaMMMi) 
= -3 7 t x (H 0 ) + 0 n x {H x ) 


= -3 <t> 


/-18Jc\ 

WT?) 


and a 0 is optimal if U nx (a 0 ) > U nx (ci\), that is if 


/ —183c\ 

* kM) 




1 > —3 <$> ( 

or, equivalently, if x < 0.2279. Therefore, 


m 


V(n x ) = mzx{KM),KM)} 

KM) = 4>(-183c/V37) - 1 if 3c < 0.2279 
KM) = -3 0 (-181/ V37) if x > 0.2279. 


Also, 


KM) = KM) = 7 T x (H 0 )u(a 0 (H 0 )) + n x (H 1 )u(a 0 (.H 1 )') 


= 0 Jt x (H 0 ) - 1 tc x (H ]) = -n x (H{) 


= -1 + O 



Therefore, the observed information is 


V X {E) = V(7T X ) - KMo) = 


1° 

[1 -4 0(-18T/V37) 


if x < 0.2279 
if 3c > 0.2279. 


In words, small values of x result in the same decision that would have been made 
a priori, and does not contribute to decision making. Values larger than 0.2279 lead 
to a reversal of the decision and are informative. For very large values the differ¬ 
ence in posterior expected utility of the two decisions approaches one. The resulting 
observed information is shown in Figure 13.7. 
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Figure 13.7 Observed information for the hypothesis testing example, as a function 
of the sample mean. 

Finally, we can now evaluate the expected value of information. Using the fact 
that x ~ N(0, 37/9), we have 




= 0.416. 


In expectation, the experiment is worth 0.416. Assuming a linear utility and constant 
cost for the observations, this implies that the decision maker would be interested in 
the experiment as long as observations had unit price lower than 0.046. 


13.3 Lindley information 

13.3.1 Definition 


In this section, we consider in more detail the decision problem of Chapter 10 and 
example 3 that is reporting a distribution regarding an unknown quantity 6 with val¬ 
ues in ©. This can be thought of as a Bayesian way of modeling a situation in which 
data are not gathered to solve a specific decision problem but rather to learn about 
the world or to provide several decision makers with information that can be useful 
in multiple decision problems. This information is embodied in the distribution that 
is reported. 
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As seen in our examples in Sections 13.1.1 and 13.1.3, if a denotes the reported 
distribution and the utility function is u(a(0)) = log a(0) the Bayes decision is 
a* — 7t, where it is the current distribution on 9. The value associated with the 
decision problem is 

V(7 r)= / \og(jt(9))n(6)d9. (13.24) 

J@ 

Given the outcome of experiment £, consisting of observing x from the distribu¬ 
tion f(x\9), we can use the Bayes rule to determine tc x . By the argument above, the 
value associated with the decision problem after observing x is 

V(n x ) = f log (n x (9))n x (Q)d6, (13.25) 

and the observed information is 


V x (£) = V(n x ) - U n Ja*) 

— [ log(7t x (9))n x (9)de - f \og(7z(6))7T x (6)d6 
J © J© 

= £ 71,(0) log d6 = T x (£). (13.26) 

Prior to experimentation, the expected information from £ is, using (13.12), 
V(£) = E X [V{ ttJ] - V(7 r) 

= J £ log (^y) TC x (e)m(x)d6dx (13.27) 

= £ J x log (f^\f(x\6)7c(9)dxd6 (13.28) 

--hK^)]] <1329) 

= E X E 9ix log^m =1(£). (13.30) 


This is called Lindley information, and was introduced in Lindley (1956). We use 
the notation 1{£) for consistency with his paper. Lindley information was originally 
derived from information-theoretic, rather than decision-theoretic, arguments, and 
coincides with the expected value with respect to x of the Kullback-Leibler (KL) 
divergence between :r x and tt, which is given by expression (13.26). 

In general, if P and Q are two probability measures defined on the same space, 
with p and q their densities with respect to the common measure v, the KL divergence 
between P and Q is defined as 

KL(P, (?) = f log ^p-p(x)dv(x) 

Jx <7W 
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(see Kullback and Leibler 1951, Kullback 1959). The KL divergence is nonnegative 
but it is not symmetric. In decision-theoretic terms, the reason is that it quantifies the 
loss of representing uncertainty by Q when the truth is P, and that may differ from 
the loss of representing uncertainty by P when the truth is Q. For further details on 
properties of the KL divergence see Cover and Thomas (1991) or Schervish (1995). 

13.3.2 Properties 

Suppose that the experiment £ n consists of observing x, and x 2 , and that the partial 
experiments of observing just x, and just x 2 are denoted by £, and £ 2 , respec¬ 
tively. Observing x 2 after x, has previously been observed has expected information 
X V2 (£ 2 \£,). As discussed in Section 13.1.3, the expected information of £ 2 after £, 
has been performed is the average of X xixi i£ 2 \£,) with respect to x, and x 2 and it is 
denoted by X(£ 2 \£,). We learned from Theorem 13.3 that the overall information in 
£ l2 can be decomposed into the sum of the information contained in the first exper¬ 
iment and the additional information gained by the second one given the first. A 
special case using Lindley information is given by the next theorem. It is interesting 
to revisit it because here the proof clarifies how this decomposition can be related 
directly to the decomposition of a joint probability into a marginal and a conditional. 

Theorem 13.4 


X(£ 12 )=X(£ l ) + X(£ 2 \£ l ). 


(13.31) 


Proof: From the definition of Lindley information we have 




(13.32) 


Also, from the definition of the information provided by £ 2 after £j has been 
performed, we have 


X(£ 2 \£ x ) = E X1X2 [T, iX2 (£,|£ 2 )] 



(13.33) 


Adding expressions (13.32) and (13.33) we obtain 


X{£,) + X{£ 2 \£,) = E Xl E X2lxi E eixi , X2 



' fix, \e)f(x 2 \e, Xl ) 

mix, )m(x 2 \x ]) 



= I(£l2) 


which is the desired result. 


□ 
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The results that follow are given without proof. Some of the proofs are good 
exercises. The first result guarantees that the information measure for any experiment 
is the same as the information for the corresponding sufficient statistic. 

Theorem 13.5 Let £ x be the experiment consisting of the observation of x, and 
let £ 2 be the experiment consisting of the observation oft(X), where t is a sufficient 
statistic. Then Z{££) = Z(£ 2 ). 

The next two results consider the case in which x\ and x 2 are independent, 
conditional on 6, that isf(x l ,x 2 \0) — f{xi\0)f(x 2 \0). 

Theorem 13.6 If X\ and x 2 are conditionally independent given 0, then 

X(£ 2 \£\) < X(£ 2 ), (13.34) 

with equality if, and only if x, and x 2 are unconditionally independent. 

This proposition says that the information provided by the second of two condition¬ 
ally independent observations is, on average, smaller than that provided by the first. 
This naturally reflects the diminishing returns of scientific experimentation. 

Theorem 13.7 If x i and x 2 are independent, conditional on 0, then 

!{£,)+! {£ 2 )>X{£ 12 ) (13.35) 

with equality if, and only if, x 2 and x 2 are unconditionally independent. 

In the case of identical experiments being repeated sequentially, we have a more 
general result along the lines of the last proposition. Let £i be any experiment and 
let £ 2 ,£t,, ... be conditionally independent and identical repetitions of £ x . Let also 

£ (n) = (£ u ...,£ n ). (13.36) 

Theorem 13.8 X(£ iny ) is a concave, increasing function ofn. 


Proof: We need to prove that 

0 < I(£ (n+1) ) - X(£ M ) < X(£ M ) - l(£ in -' ] ). (13.37) 

By applying equation (13.31) we have Z(£ (n+1) ) = X(£„ + \\£ M ) + 1(£ M ). 
Next, using the fact that Lindley information is nonnegative, we obtain T(£ ln+h ) — 
I{£ ( " y ) > 0, which proves the left side inequality. 

Using again equation (13.31) on both Z(£ (,,+1) ) and X(£ M ) we can observe that 
the right side inequality is equivalent to X(£ n+l |£ ( '°) < X(£„\£ ( "~ l) ). As £ n+1 = £„, 
the inequality becomes X(£ n+l |5 ( '°) < X{£ nJrl \£ {n ~ l) ). Also, since £ (n] = (£ (n ~ l) ,£ n ) 
we can rewrite the inequality as X(£„ +l |£ 0,-1) , £„) < X(5„+i |£ ( " _1) ). The proof of this 
inequality follows from an argument similar to that of (13.31). □ 
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As the value of V is convex in the prior, increased uncertainty means decreased 
value of stopping before experimenting. We would then expect that increased 
uncertainty means increased value of information. This is in fact the case. 

Theorem 13.9 For fixed £, I(£) is a concave function of tt(0). 

This means that if iti and 7 r 2 are two prior distributions for 0, and 0 < a < 1, 
then the information provided by an experiment £ with a mixture prior n 0 {9) — 
anfO) + (1 — ot)n 2 (Q) is at least as large as the linear combination of the infor¬ 
mation provided by the experiment under priors 7 ly and 7 r 2 . In symbols, X 0 (£) > 
aXf£) + (1 — a)I 2 i£) where Xf£) is the expected information provided by £ under 
prior 7r, for i = 0,1,2. 

13.3.3 Computing 

In realistic applications closed form expressions for X are hard to derive. Numer¬ 
ical evaluation of expected information requires Monte Carlo integration (see, for 
instance, (Gelman etal. (1995)). The idea is as follows. First, simulate points 0, from 
7 x{0) and, for each, x, from f(x\() l ), with i = 1,...,/. Using the simulated sample 
obtain the Monte Carlo estimate for E e E x \ e [log[/Cr|0)]], that is 



(13.38) 


Next, evaluate E x [log(m(x))], again using Monte Carlo integration, by calculating 



(13.39) 


When the marginal is not available in closed form, this requires another Monte Carlo 
integration to evaluate m( jq) for each i as 



where the 9j are drawn from jt(0). Using these quantities, Lindley information is 
estimated by taking the difference between (13.38) and (13.39). Carlin and Poison 
(1991) use an implementation of this type. Muller and Parmigiani (1993) and Muller 
and Parmigiani (1995) discuss computational issues in the use of Markov-chain 
Monte Carlo methods to evaluate information-theoretic measures such as entropy, 
KL divergence, and Lindley information, and introduce fast algorithms for optimiza¬ 
tion problems that arise in Bayesian design. Bielza el al. Insua (1999) extend those 
algorithms to multistage decision trees. 
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13.3.4 Optimal design 

An important reason for measuring the expected information provided by a prospec¬ 
tive experiment is to optimize the experiment with respect to some design variables 
(Chaloner and Verdinelli 1995). Consider a family of experiments £,, indexed by the 
design variable D. The likelihood function for the experiment £,, is/ D (x|$), x e X. 
Let 7 t d (6 |x) denote the posterior distribution resulting from the observation of x and 
let m D (x) denote the marginal distribution of x. If D is the sample size of the experi¬ 
ment, we know that T(£ n ) is monotone, and choosing a sample size requires trading 
off information against other factors like cost. 

In many cases, however, there will be a natural trade-off built into the choice 
of D and I(£ D ) will have a maximum. Consider this simple example. Let the ran¬ 
dom variable Y have conditional density f(y\9). We are interested in learning about 
0, but cannot observe y exactly. We can only observe whether or not it is greater 
than some cutpoint D. This problem arises in applications in which we are interested 
in estimating the rate of occurrence of an event, but it is impractical to monitor the 
phenomenon studied continuously. What we observe is then a Bernoulli random vari¬ 
able x that has success probability F(D\0). Choices of D that are very low will give 
mostly zeros and be uninformative, while choices that are very high will give mostly 
ones and also turn out uninformative. The best D is somewhere in the center of the 
distribution of y, but where exactly? 

To solve this problem we first derive general optimality conditions. Let us assume 
that D is one dimensional, continuous, or closely approximated by a continuous 
variable, and that/ D (0|x) is differentiable with respect to D. Lindley information is 



Define 




The derivative of X(£ D ) with respect to D turns out to be 



(13.40) 


To see that (13.40) holds, differentiate the integrand of X{£ D ) to get 



It follows from differentiating Bayes’ rule that 
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Thus 

f D Tt' D _ m D Tt' D _ f D m D -f D m' D _ ^ m' D 

— — ~ Jd ~ Id • 

71 d 7T Dip Dip 

It can be proved that the following area conditions hold: 


f f f D (x\9)jt(9)'^ 

Js Jx rn D (x) 


dxdO = 


= / m D 

J X 


(x)dx = 0, 



(x\9)n(9)dxd9 = 0. 


Equation (13.40) follows from this. 

With simple manipulations, (13.40) can be rewritten as 


dI{S D ) 

dD 


fj 

Jq J x 


L 


log (fD(x\9))fn(x\9)jr(9)dxd9 — / log (m D (x)) m D (x) dx. (13.41) 


This expression parallels the alternative expression for (13.29), given by Lindley 
(1956): 


U£d) = 


= fj 


log (f D (x\9))f D (x\9)jT(9)dxd9 


L 


log ( m D {x )) m D (x)dx. (13.42) 


Equations (13.40) and (13.41) can be used in calculating the optimal D whenever 
Lindley information enters as a part of the design criterion. For example, if D is 
sample size, and the cost of observation c is fixed, then the real-valued approximation 
to the optimal sample size satisfies 


dl(S D ) 

dD 


— c = 0. 


(13.43) 


Returning to the optimal cutoff problem, in which c = 0, we have 


fo(x\9) = 


| F(D\9), 
(1-F(D|0), 


ifx = 1 
if x = 0, 


which gives 


f D (m = 


\m\9), 

\-f(D\9), 


if x = 1 
if x = 0. 


Similarly, if F{x) is the marginal cdf of x. 


j f B F(D\9)jt(9)d9 = F(D), ifx=l 

m ° {X \ 1 - fe F(D\0)n(6)d6 = 1 - F(D), if x = 0 


(13.44) 


(13.45) 


(13.46) 
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with 

m' (x) = I f* f(P \ 6)n{6)d6 = f(D) ’ ifx = 1 

' Y 1 - jj(D\e)7t{e)d.o = —f(D), ifx = o 

By inserting (13.45) through (13.47) into (13.41), we obtain 

L 108 (T&m) f<DmnmM = 108 (urk) /<z>1 ' 


(13.47) 


(13.48) 


Now, by dividing both sides by f(D) we obtain that a necessary condition for D to be 
optimal is 



F(D\9) \ 
-F(D\6)J 


n(6\x — D)d6 = log 


F(D) 

1 - F(D) 


(13.49) 


in words, we must choose the cutoff so that the expected conditional logit is equal to 
the marginal logit. Expectation is taken with respect to the posterior distribution of 6 
given that x is at the cutoff point. 


13.4 Minimax and the value of information 

Theorem 13.2 requires that actions should be chosen according to the expected utility 
principle. What happens if an agent is minimax and a is chosen accordingly? In that 
case, the minimax principle can lead to deciding without experimenting, in instances 
where it is very hard to accept the conclusion that the new information is worthless. 
The point was first raised by Savage, as part of his argument for the superiority of 
the regret (or loss) form over the negative utility form of the minimax principle: 

Reading between the lines, it appears that Wald means to work with loss 
and not negative income. For example, on p.124 he says that if a certain 
experiment is to be done, and the only thing to decide is what to do after 
seeing its outcome, then the cost of the experiment (which may well 
depend on the outcome) is irrelevant to the decision; this statement is 
right for loss but wrong for negative income. (Savage 1951, p. 65) 

While Savage considered this objection not to be relevant for the regret case, one 
can show that the same criticism applies to both forms. One can construct examples 
where the minimax regret principle attributes the same value V to an ancillary statis¬ 
tic and to a consistent estimator of the state of nature for arbitrary n. In other instances 
(Hodges and Lehmann 1950) the value is highest for observing an inconsistent 
estimator of the unknown quantity, even if a consistent estimator is available. 

A simple example will illustrate the difficulty with minimax (Parmigiani 1992). 
This is constructed so that minimax is well behaved in the negative utility version of 
the problem, and not so well behaved in the regret (or loss) version. Examples can 
be constructed where the reverse is true. A box contains two marbles in one of three 
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possible configurations: both marbles are red, the first is blue and the second is red, or 
both are blue. You can choose between two actions, whose consequences are summa¬ 
rized in the table below, whose entries are negative payoffs (a 4 means you pay $4): 



RR 

BR 

BB 

Cl\ 

2 

0 

4 


4 

4 

0 


A calculation left as an exercise would show that the minimax action for this 
problem is a mixed strategy that puts weight 2/3 on a t and 1/3 on ci 2 , and has risk 
2 + 2/3 in states RR and BB and 1 + 1/3 in state BR. 

Suppose now that you have the option to see the first marble at a cost of $0.1. 
To determine whether the observation is worthwhile, you look at its potential use. It 
would not be very smart to choose act a 2 after seeing that the first marble is red, so 
the undominated decision functions differ only in what they recommend to do after 
having seen that the first marble is blue. Call a,(B) the function choosing a .\, and 
a 2 (B ) that choosing a 2 . Then the available unmixed acts and their consequences are 
summarized in the table 



RR 

BR 

BB 

Q\ 

2 

0 

4 

Cl 2 

4 

4 

0 

ai(B) 

2.1 

0.1 

4.1 

a 2 (B) 

2.1 

4.1 

0.1 


As you can see, a t (B) is not an admissible decision rule, because it is dominated by 
«i, but a 2 (B) is admissible. The equalizer strategy yields an even mixture of a t and 
a 2 (B ) which has average loss of $2.05 in every state. So with positive probability, 
minimax applied to the absolute losses prescribes to make the observation. 

Let us now consider the same regret form of the loss table. We get that by shifting 
each column so that the minimum is 0, which gives 



RR 

BR 

BB 

Cl\ 

0 

0 

4 

0.2 

2 

4 

0 


The minimax strategy for this table is an even mixture of a, and a 2 and has risk 1 
in state RR and 2 in the others. If you have the option of experimenting, the unmixed 
table is 
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RR 

BR 

BB 

a\ 

0 

0 

4 

a 2 

2 

4 

0 

ai(B) 

0.1 

0.1 

4.1 

a 2 (B) 

0.1 

4.1 

0.1 


In minimax decision making, you act as though an intelligent opponent decided 
which state occurs. In this case, state RR does not play any role in determining the 
minimax procedure, being dominated by state BR from the point of view of your 
opponent. You have no incentive to consider strategies a t (B) and a 2 (B ): any mixture 
of one of these with the two acts that do not include observation will increase the 
average loss over the value of 2 in either column. 

So in the end, if you follow the minimax regret principle you never perform 
the experiment. The same conclusion is reached no matter how little the experiment 
costs, provided that the cost is positive. The assumption that you face an opponent 
whose gains are your losses makes you disregard the fact that, in some cases, observ¬ 
ing the first marble gives you the opportunity to choose an act with null loss, and 
only makes you worry that the marble may be blue, in which case no further cue 
is obtained. This concern can lead you to “ignore extensive evidence,” to put it in 
Savage’s terms. What makes the negative utility version work the opposite way is 
the “incentive” given to the opponent to choose the state where the experiment is 
valuable to you. 

Chernoff (1954) noticed that an agent minimizing the maximum regret could 
in some cases choose action ci[ over action a 2 when these are the only available 
options, but choose a 2 when some other option a 3 is made available. This annoying 
feature originates from the fact that turning a payoff table into a regret table requires 
subtracting constants to each column, and these constants can be different when new 
rows appear. In our example, the added rows arise from the option to experiment: 
the losses of a rule based on an experiment are linear combinations of entries already 
present in the same column as payoffs for the terminal actions, and cannot exceed the 
maximum loss when no experimentation is allowed. This is a different mechanism 
whereby minimax decision making to “ignore extensive evidence” compared to the 
earlier example from Chernoff, because the constants for the standardization to regret 
are the same after one adds the two rows corresponding to experimental options. 


13.5 Exercises 

Problem 13.1 Prove equation (13.7). 

Problem 13.2 (Savage 1954) In the example of Section 13.2.1 suppose the deci¬ 
sion maker could directly observe the quality of the grapes. Show that the decision 
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maker’s best action would then yield 2 utilities, and show that it could not possibly 
lead the decision maker to buy 2 pounds of grapes. 

Problem 13.3 Using simulation, approximate the distribution of V x (£) in the exam¬ 
ple in Section 13.1.4. Suppose each observation costs $230. Compute the marginal 
cost-effectiveness ratio for performing the experiment in the example (versus the 
status quo of no experimentation). Use the negative of the loss as the measure of 
effectiveness. 

Problem 13.4 Derive the four minimax strategies advertised in Section 13.4. 

Problem 13.5 Consider an experiment consisting of a single Bernoulli observa¬ 
tion, from a population with success probability 6. Consider a simple versus simple 
hypothesis testing situation in which A — \a a , a } }, © = {0.50,0.75}, and with utilities 
as shown in Table 13.7. Compute V(£) — E x [V{n x )] — V(n). 

Problem 13.6 Show that X is invariant to one-to-one transformations of the 
parameter 9. 

Problem 13.7 Let/(x|0) = pf l (x\9)+(l—p)f 2 (x\0). Let £, £ u £ 2 be the experiments 

of observing a sample fxomf{x\0),fi(x\9 ) and/ 2 (x|0) respectively. Show that 

1{£) < P X(£ i) + (1 - p)l(£ 2 ). (13.50) 

Problem 13.8 Consider a population of individuals cross-classified according to 
binary variables x, and x 2 . Suppose that the overall proportions of individuals with 
characteristics v, and x 2 are known to be p, and p 2 respectively. Say /;, < p 2 < 1 — 
p 2 < 1 — pi. However, the proportion a> of individuals with both characteristics is 
unknown. We are considering four experiments: 

1. Random sample of individuals with x t = 1 

2. Random sample of individuals with X\ — 0 

3. Random sample of individuals with x 2 = 1 

4. Random sample of individuals with x 2 = 0. 


Table 13.7 

Utilities for Problem 13.5. 

Actions 

States of nature 


0.50 

0.75 


0 

-1 

Cl\ 

-1 

0 
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Using the result stated in Problem 13.7 show that experiment 1 has the highest 
information on co irrespective of the prior. 

Hint: Work with the information on 9 = o>/p, and then translate the result in 
terms of a>. 

Problem 13.9 Prove Theorem 13.5. 

Hint: Use the factorization theorem about sufficient statistics. 

Problem 13.10 Prove Theorem 13.6. Use its result to prove Theorem 13.7. 

Problem 13.11 Consider an experiment £ that consists of observing n conditionally 
independent random variables x lt ... ,x„, with x, ~ N(0, a 2 ), with a known. Suppose 
also that a priori 0 ~ AI(/r 0 , r 0 2 ). Show that 



You can use facts about conjugate priors from Bernardo and Smith (2000) or Berger 
(1985). However, please rederive 1. 

Problem 13.12 Prove Theorem 13.9. 

Hint: As in the proof of Theorem 13.1, consider any pair of prior distributions jti 
and jr 2 of 0 and 0 < a < 1. Then, calculate Lindley information for any experiment 
£ with prior distribution onti + (1 — a)7t 2 for the unknown 0. 
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Sample size 


In this chapter we discuss one of the most common decisions in statistical practice: 
the choice of the sample size for a study. We initially focus on the case in which the 
decision maker selects a fixed sample size n, collects a sample x it ... ,x„, and makes 
a terminal decision based on the observed sample. Decision-theoretic approaches to 
sample size determination formally model the view that the value of an experiment 
depends on the use that is planned for the results, and in particular on the decision 
that the results must help address. This approach, outlined in greater generality in 
Chapter 13, provides conceptual and practical guidance for determining an optimal 
sample size in a very broad variety of experimental situations. Here we begin with an 
overview of decision-theoretic concepts in sample size. We next move to a general 
simulation-based algorithm to solve optimal sample size problems in complex prac¬ 
tical applications. Finally, we illustrate both theoretical and computational concepts 
with examples. 

The main reading for this chapter is Raiffa and Schlaifer (1961, chapter 5), who 
are generally credited for providing the first complete formalization of Bayesian 
decision-theoretic approaches to sample size determination. The general ideas 
are implicit in earlier decision-theoretic approaches. For example Blackwell and 
Girshick (1954, page 170) briefly defined the framework for Bayesian optimal fixed 
sample size determination. 

Featured book (Chapter 5): 

Raiffa, H. and Schlaifer, R. (1961). Applied Statistical Decision Theory, Harvard 
University Press, Boston, MA. 

Textbook references on Bayesian optimal sample sizes include DeGroot (1970) 
and Berger (1985). For a review of Bayesian approaches to sample size deter¬ 
mination (including less formal decision-theoretic approaches) see Pham-Gia and 


Decision Theory: Principles and Approaches G. Parmigiani, L. Y. T. Inoue 
© 2009 John Wiley & Sons, Ltd 
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Turkkan (1992), Adcock (1997), and Lindley (1997); for practical applications see, 
for instance, Yao et al. (1996) and Tan and Smith (1998). 


14.1 Decision-theoretic approaches to sample size 

14.1.1 Sample size and power 

From a historical perspective, a useful starting point for understanding decision- 
theoretic sample size determination is the Neyman-Pearson theory of testing a 
simple hypothesis against testing a simple alternative (Neyman and Pearson 1933). A 
standard approach in this context is to seek the smallest sample size that is sufficient 
to achieve a desired power at a specified significance level. From the standpoint of 
our discussion, there are three key features in this procedure: 

1. After the data are observed, an optimal decision rule—in this case the use of 
a uniformly most powerful test—is used. 

2. The sample size is chosen to be the smallest that achieves a desired perfor¬ 
mance level. 

3. The same optimality criteria—in this case significance and power—are used 
for selecting both a terminal decision rule and the sample size. 

As in the value of information analysis of Chapter 13, the sample size depends 
on the benefit expected from the data after it will be put to use, and the quantification 
of this benefit is user specified. This structure foreshadows the decision-theoretic 
approach to sample size that was to be developed in later years in the influential 
work by Wald (1947b), described below. 

Generally, in choosing the sample size for a study, one tries to weigh the 
trade-off between carrying out a small study and improving the final decision with 
respect to some criterion. This can be achieved in two ways: either by finding 
the smallest sample size that guarantees a desired level of the criterion, or by 
explicitly modeling the trade-off between the utilities and costs of experimenta¬ 
tion. Both of these approaches are developed in more detail in the remainder of this 
section. 

14.1.2 Sample size as a decision problem 

In previous chapters we dealt with problems where a decision maker has a utility 
u{a{0)) for decision a when the state of nature is 0. In the statistical decision- 
theoretic material we also cast the same problem in terms of a loss L(6,a ) for making 
decision a when the state of nature is 0, and the decision is based on a sample 
x" = (xi,x 2 ,... ,x„) of size n drawn from/(x|0). We now extend this to also choos¬ 
ing optimally the sample size n of the experiment. We first specify a more general 
detailed function u(6,a, n) or, equivalently, a loss function L(0, a, n) for observing a 
sample of size n and making decision a when the state of nature is 0. The sample 
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size n is really an action here, so the concepts of utility and loss are not altered in 
substance. We commonly use the form 


u(9,a,n ) = u(a(9)) — C{n ) 
L(9, a, n ) = L{9, a) + C(n). 


(14.1) 

(14.2) 


The function L(9,a) in (14.2) refers, as in previous chapters, to the choice of a after 
the sample is observed and is referred to as the terminal loss function. The function 
C(n) represents the cost of collecting a sample of size n. This formulation can be 
traced to Wald (1947a) and Wald (1949). 

The practical implications of accounting for the costs of experimentation in 
(14.2) is explained in an early paper by Grundy et al .: 

One of the results of scientific research is the development of new pro¬ 
cesses for use in technology and agriculture. Usually these new processes 
will have been worked out on a small scale in the laboratory, and expe¬ 
rience shows that there is a considerable risk in extrapolating laboratory 
results to factory or farm scale where conditions are less thoroughly 
controlled. It will usually pay, therefore, to carry out a programme of 
full-scale experiments before allowing a new process to replace an old 
one that is known to give reasonably satisfactory results. However, full- 
scale experimentation is expensive and its costs have to be set against any 
increase in output which may ensue if the new process fulfills expecta¬ 
tions. The question then arises of how large an experimental programme 
can be considered economically justifiable. If the programme is exten¬ 
sive, we are not likely to reject a new process that would in fact have 
been profitable to install, but the cost of the experiments may eat up 
most of the profits that result from a correct decision; if we economize 
on the experiments, we may fail to adopt worthwhile improvements in 
technique, the results of which have been masked by experimental errors. 
(Grundy et al. 1956, p. 32) 

Similar considerations apply to a broad range of other scientific contexts. Raiffa 
and Schlaifer further comment on the assumption of additivity of the terminal utility 
u(a(9)) and experimental costs C(n ): 

this assumption of additivity by no means restrict us to problems in 
which all consequences are monetary.... In general, sampling and ter¬ 
minal utilities will be additive whenever consequences can be measured 
or scaled in terms of any common numeraire the utility of which is lin¬ 
ear over a suitably wide range; and we point out that number of patients 
cured or number of hours spent on research may well serve as such a 
numeraire in problems where money plays no role at all. (Raiffa and 
Schlaifer 1961, p. xiii) 
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14.1.3 Bayes and minimax optimal sample size 

A Bayesian optimal sample size can be determined for any problem by specifying 
four components: (a) a likelihood function for the experiment; (b) a prior distri¬ 
bution on the unknown parameters; (c) a loss (or utility) function for the terminal 
decision problem; and (d) a cost of experimentation. Advantages include explicit 
consideration of costs and benefits, formal quantification of parameter uncertainty, 
and consistent treatment of this uncertainty in the design and inference stages. 

Specifically, say tc represents a priori knowledge about the unknown parameters, 
and <5* is the Bayes rule with respect to that prior (see Raiffa and Schlaifer 1961, 
or DeGroot 1970). As a quick reminder, the Bayes risk is defined, as in equation 
(7.9), by 

r(n,8 n )= / R(Q,8 n )n{9)dQ 

Je 

and the Bayes rule <5* minimizes the Bayes risk r(n, <5„) among ah the possible S„ 
when the sample size is n. Thus the optimal Bayesian sample size n* is the one that 
minimizes 


r(jv, n) = r(jt, 8*) + C(n). 


(14.3) 


By expanding the definition of the risk function and reversing the order of 
integration, we see that 

r(n,n)= f \ L(0,S*(x"),n) n(0\x!')d6 m^dxT. (14.4) 

Jx Je 

Similarly, the Bayesian formulation can be reexpressed in terms of utility as the 
maximization of the function 

UJn) = I j u(0,&*(xr),n)7r(e\xr)dOm(?<r)dxr. (14.5) 

Jx Je 

Both formulations correspond to the solution of a two-stage decision tree similar 
to those of Sections 12.3.1 and 13.1.3, in which the first stage is the selection of the 
sample size, and the second stage is the solution of the statistical decision problem. In 
this sense Bayesian sample size determination is an example of preposterior analysis. 
Experimental data are conditioned upon in the inference stage and averaged over in 
the design stage. 

By contrast, in a minimax analysis, we would analyze the data using the minimax 
decision function <5" chosen so that 

inf supR(0,i5„) = supR(6>,<5“). 

SneT) 


Thus the best that one can expect to achieve after a sample of size n is observed is 
sup fle 0 R{9,8"), and the minimax sample size n M minimizes 

sup R(d, 8™)+ C(n). 

e 


(14.6) 
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Exploiting the formal similarity between the Bayesian and minimax schemes, 
Blackwell and Girshick (1954, page 170) suggest a general formalism for the optimal 
sample size problem as the minimization of a function of the form 

p„(7t,S„) +C(n), (14.7) 

where S„ is a terminal decision rule based on n observations. This formulation encom¬ 
passes both frequentist and Bayesian decision-theoretic approaches. In the Bayesian 
case S„ is replaced by 8*. For the minimax case, S„ is the minimax rule, and n the 
prior that makes the minimax rule a Bayes rule (the so-called least favorable prior), 
if it exists. This formulation allows us in principle to specify criteria that do not make 
use of the optimal decision rule in evaluating p, a feature that may be useful when 
seeking approximations. 


14.1.4 A minimax paradox 

With some notable exceptions (Chernoff and Moses 1959), explicit decision- 
theoretic approaches are rare among frequentist sample size work, but more common 
within Bayesian work (Lindley 1997, Tan and Smith 1998, Muller 1999). Within the 
frequentist framework it is possible to devise sensible ad hoc optimal sample size 
criteria for specific families of problems, but it is more difficult to formulate a gen¬ 
eral rationality principle that can reliably handle a broad set of design situations. To 
illustrate this point consider this example, highlighting a limitation of the minimax 
approach. 

Suppose we are interested in testing H 0 :0 <6 0 versus //, : 0 > 0 O based on obser¬ 
vations sampled from N(Q,o 2 ), the normal distribution with unknown mean 0 and 
known variance a 2 . We can focus on the class of tests that reject the null hypothesis 
if the sample mean is larger than a cutoff d, 8 d (x n ) = where I E is the indicator 

function of the event E. The sample mean is a sufficient statistic and this class is 
admissible (see Berger 1985 for details). 

Let F(.\d) be the cumulative probability function of x„ given 0. Under the 0-1 -k 
loss function of Table 14.1, the associated risk is given by 


R(e,s d ) 


\k(\-F(d\6)) if 0 < 0 O 
[F(d\9) if 0 > 0 O . 


Table 14.1 The 0-1-k loss function. No loss is incurred if H„ 
holds and action a 0 is chosen, or if //, holds and action a x is 
chosen. If H x holds and decision a,, is made, the loss is 1; if H 0 
holds and decision a, is made, the loss is k. 



H 0 

H x 

a 0 : accept H 0 

0 

1 

a t : reject H 0 

k 

0 
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Let V(9) = £i„|0[<5</(x„)] = 1 — F(d\9) denote the power function (Lehmann 1983). 
In the frequentist approach to hypothesis testing, the power function evaluated at 9 0 
gives the type I error probability. When evaluated at any given 0 = 9y chosen within 
the alternative hypothesis, the power function gives the power of the test against the 
alternative 9y. 

Since V is increasing in 9, it follows that 


sup R(9, 8 d ) — max {kV(6 0 ), 1 - V(9 0 )}, 


o 


which is minimized by choosing d such that kV(9 0 ) = 1 — V(9 0 ), that is V(9 0 ) = 
1/(1 + k). When one specifies a type I error probability of a, the implicit k can be 
determined by solving 1/(1+ k) — a, which implies k = (1 — a)/a. 

Solving the equation V(9o) = 1/(1 + k) with respect to the cutoff point d, we 
find a familiar solution: 



Since inf j6C sup 0 R{0, 5) = sup s R(9, 8 d iu), 8 M = 8 d M is the minimax rule. 

It follows that sup e R(9,8 d M) + C(n ) = k/(k + 1) + C(n ) is an increasing func¬ 
tion in n, as long as C is itself increasing, which is always to be expected. Thus, 
the optimal sample size is zero! The reason is that, with the 0-1 —k loss function, the 
minimax risk is constant in 9. The minimax strategy is rational if an intelligent oppo¬ 
nent whose gains are our losses is choosing 9. In this problem such an opponent will 
always choose 9 — 9 0 and make it as hard as possible for us to distinguish between 
the two hypotheses. This is another example of the “ultra-pessimism” discussed in 
Section 13.4. 

Let us now study the problem from a Bayesian viewpoint. Using the 0-1 -k loss 
function, the posterior expected losses of accepting and rejecting the null hypothesis 
ar|x") and kjt(H a \x!‘), respectively. The Bayes decision minimizes the posterior 
expected loss. Thus, the Bayes rule rejects the null hypothesis whenever n{Hi |x") > 
k/( 1 + k) (see Section 7.5.1). Suppose we have a N(fi 0 , r 0 2 ) prior on 9. Solving the 
previous inequality in x n , we find that the Bayes decision rule is 8*(x") — >,/«[ with 

the cutoff given by 



With a little bit of work, we can show that the Bayes risk r is given by 




(14.8) 
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Figure 14.1 Bayes risk r(jt , n) as a function of the sample size n with the 0-1—k loss 
function. In the left panel the cost per observation is c = 0.001, in the right panel 
the cost per observation is c = 0.0001. While the optimal sample size is unique in 
both cases, in the case on the right a broad range of choices of sample size achieve 
a similar risk. 


The above expression can easily be evaluated numerically and optimized with respect 
to n, as we did in Figure 14.1. While it is possible that the optimal n may be zero if 
costs are high enough, a range of results can emerge depending on the specific combi¬ 
nation of cost, loss, and hyperparameters. For example, if C(n ) = cn, with c = 0.001, 
and if k — 19, a — 1, and r 0 = 1, then the optimal sample size is n B — 54. When the 
cost per observation c is 0.0001, the optimal sample size is n B — 255. 

14.1.5 Goal sampling 

A different Bayesian view of sample size determination can be traced back to Lind- 
ley’s seminal paper on measuring the value of information, discussed in Chapter 13. 
There, he suggests that a: 

consequence of the view that one purpose of statistical experimentation 
is to gain information will be that the statistician will stop experimenta¬ 
tion when he has enough information. (Lindley 1956, p. 1001) 

A broad range of frequentist and Bayesian methods for sample size determina¬ 
tion can be described as choosing the smallest sample that is sufficient to achieve, in 
expectation, some set goals of experimentation. An example of the former is seeking 
the smallest sample size that is sufficient to achieve a desired power at a specified sig¬ 
nificance level. An example of the latter is seeking the smallest sample size necessary 
to obtain, in expectation, a fixed width posterior probability interval for a parameter 
of interest. We refer to this as goal sampling. It can be implemented sequentially, by 
collecting data until a goal is met, or nonsequentially, by planning a sample that is 
sufficient to reach the goal in expectation. 
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Choosing the smallest sample size to achieve a set of goals can be cast in 
decision-theoretic terms, from both the Bayesian and frequentist viewpoints. This 
approach can be described as a constrained version of that described in Section 
14.1.3, where one sets a desired value of p and minimizes (14.7) subject to that 
constraint. Because both cost and performance are generally monotone in n , an 
equivalent formulation is to choose the smallest n necessary to achieve a desired p. 
This formulation is especially attractive when cost of experimentation is hard to 
quantify or when the units of p are difficult to compare to the monetary unit c. 

In hypothesis testing, we can also use this idea to draw correspondences between 
Bayesian and classical approaches to sample size determination based on controlling 
probabilities of classification error. We illustrate this approach in the case of testing a 
simple hypothesis against a simple alternative with normal data. Additional examples 
are in Inoue et al. (2005). 

Suppose that Xi,... ,x„ is a sample from a N(6, 1) distribution. It is desired to 
test H o :0 — 6 0 versus //, : 0 = 7,, where ()\ > Q 0 , with the 0-1 loss function. This 
is a special case of the 0-l-£ loss of Table 14.1 with k— 1 and assigns a loss of 
0 to both correct decisions and a loss of 1 to both incorrect decisions. Assume a 
priori that n(H 0 ) = 1 — n{H l ) = n — 1/2. The Bayes rule is to accept H 0 whenever 
7t(H 0 \x") >1/2 or, equivalently, x n < ( 6 0 + 6i)/2. Let F(.\0) denote the cumulative 
probability function of x„ given 6. Under the 0-1 loss function, the Bayes risk is 


r(ir,8* n ) 


1 

2 



= 1 — 0 


(0i-fl>) \ 

— 2 — ''T 


1/ 00 + 01 
2 V 2 



(14.9) 


where <t>(z) is the cdf of a standard normal density. The Bayes risk in (14.9) is 
monotone and approaches 0 as the sample size approaches infinity. 

The goal sampling approach is to choose the minimum sample size needed to 
achieve a desired value of the expected Bayes risk, say r 0 . In symbols, we find n rQ 
that solves r(n, <5*) = r 0 . Applying this to equation (14.9) we obtain 


n 


r o 



(14.10) 


where Zi- ro is such that O(z!_ ro ) = 1 — r 0 . 

A standard frequentist approach to sample size determination corresponds to 
determining the sample size under specified constraints in terms of probabilities of 
type I and type II error (see, for example, Desu and Raghavarao 1990, Shuster 1993, 
Lachin 1998). In our example, the frequentist optimal sample size is given by 


n F = 


/ Za + Zp 

U-0o 


2 


(14.11) 


where a and ji correspond to the error probabilities of type I and II, respectively. 
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The frequentist approach described above does not make use of an explicit loss 
function. Lindley notes that: 

Statisticians today commonly do not deny the existence of either utilities 
or prior probabilities, they usually say that they are difficult to spec¬ 
ify. However, they often use them (in an unspecified way) in order to 
determine significance levels and power. (Lindley 1962, p. 44) 

Although based on different concepts, we can see in this example that classical and 
Bayesian optimal sample sizes are the same whenever 


ki-r„l = \z a +z^\/2. (14.12) 

Inoue et al. (2005) discuss a general framework for investigating the relation¬ 
ship between the two approaches, based on identifying mappings that connect the 
Bayesian and frequentist inputs necessary to obtain the same sample size, as done 
in (14.12) for the mapping between the frequentist inputs a and fi and the Bayesian 
input r 0 . Their examples illustrate that one can often find correspondences even if the 
underlying loss or utility functions are different. 

Adcock (1997) formalizes goal-based Bayesian methods as those based on 
setting 

EAW)] = r 0 (14.13) 

and solving for n. Here T(x ") is a test statistic derived from the posterior distribution 
on the unknowns of interest. It could be, for instance, the posterior probability of a 
given interval, the length of the 95% highest posterior interval, and so forth. T(x") 
plays the role of the posterior expected loss, though direct specification of T bypasses 
both the formalization of a decision-theoretic model and the associated computing. 
For specific test statistics, software is available (Pham-Gia and Turkkan 1992, Joseph 
et al. 1995, and Joseph and Wolfson 1997) to solve sample size problems of the form 
(14.13). 

Lee and Zelen (2000) observe that even in frequentist settings it can be useful to 
set a and /I so that they result in desirable posterior probabilities of the hypothesis 
being true conditional on the study’s result. Their argument relies on a frequentist 
analysis of the data, but stresses the importance of (a) using quantities that are con¬ 
ditional on evidence when making terminal decisions, and (b) choosing a and /I in 
a way that reflects the context to which the results are going to be applied. In the 
specifics of clinical trials, they offer the following considerations: 

we believe the two fundamental issues are (a) if the trial is positive “What 
is the probability that the therapy is truly beneficial?” and (b) if the trial 
is negative, “What is the probability that the therapies are comparable?” 

The frequentist view ignores these fundamental considerations and can 
result in positive harm because of the use of inappropriate error rates. 

The positive harm arises because an excessive number of false positive 
therapies may be introduced into practice. Many positive trials may be 
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unethical to duplicate and, even if replicated, could require many years 
to complete. Hence a false positive trial outcome may generate many 
years of patients’ receiving non beneficial therapies. (Lee and Zelen 
2000, p. 96) 

14.2 Computing 

In practice, finding a Bayesian optimal sample size is often too complex for ana¬ 
lytic solutions. Here we describe two general computational approaches for Bayesian 
sample size determination and illustrate them in the context of a relatively simple 
example. Let x denote the number of successes in a binomial experiment with suc¬ 
cess probability 6. We want to optimally choose the sample size n of this experiment 
assuming a priori a mixture of two experts’ opinions as represented by 

7 t(9) = 0.5 Beta(d\3, 1) + 0.5 Beta{6 13,3). 

For the terminal decision problem of estimating 6 assume that the loss func¬ 
tion is L{0,a) = \a — 0\. Let S*(x) denote the Bayes rule, in this case the posterior 
median (Problem 7.7). Let the sampling cost be 0.0008 per observation, that is 
C{n) = 0.0008m. The Bayes risk is 

r(n,n ) = jf ^[|<5*(x) -e\ + 0.0008m] ^”^$*(1 - ey-*7T(0)d0. (14.14) 

This decision problem does not yield an analytical solution. We explore an alter¬ 
native solution that is based on Monte Carlo simulation (Robert and Casella 1999). 
A straightforward implementation of Monte Carlo integration to evaluate (14.14) is 
as follows. Select a grid of plausible values of n, and a simulation size I. 

1. For each n, draw (0,,x,) for i = 1,...,/. This can be done by generating 0, 
from the prior and then, conditionally on 0 h generating x, from the likelihood. 

2. Compute 1, = L(0 h <5*(.r,)) + C{n). 

3. Compute ?(tt , n) = (1/7) 

4. Numerically minimize r with respect to n. 

This is predicated on the availability of a closed form for 8*. When that is not 
the case things become more complicated, but other Monte Carlo schemes are 
available. 

In practice the true r can be slow varying and smooth in m, while r can be sub¬ 
ject to residual noise from step 3, unless I is very large. Muller and Parmigiani 
(1995) discuss an alternative computational strategy that exploits the smoothness 
of r by considering a Monte Carlo sample over the whole design space, and then 
fit a curve for the Bayes risk r(n,n) as a function of n. The optimization problem 
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can then be solved in a deterministic way by minimization of the fitted curve. In 
the context of sample size determination, the procedure can be implemented as 
follows: 

1. Select i points n„ i = 1.,/, possibly including duplications. 

2. Draw (0,,x,) for i = 1This can be done by generating 0 I from the prior 
and then x, from the likelihood. 

3. Compute /, = L(0 h S*(x i )) + C(n,). 

4. Fit a smooth curve r(n,n) to the resulting points («,. /,-). 

5. Evaluate deterministically the minimum of r(jz , n). 

This effectively eliminates the loop over n implied by step 1 of the previous method. 
This strategy can also be applied when L is a function of n and when C is a function 
of x and 0. 

To illustrate we choose the simulation points n, to be integers chosen uniformly in 
the range (0,120). The Monte Carlo sample size is 1 — 200. Figure 14.2(a) shows the 
simulated pairs (n,, /,). We fitted the generated data points by a nonlinear regression 
of the form 


r(n,n) = 0.0008 n + 0.1929(1 + bn)~ a 


(14.15) 


where the values of a and b are estimated based on the Monte Carlo sample. This 
curve incorporates information regarding (i) the sampling cost; (ii) the value of the 
expected payoff when no observations are taken, which can be computed easily and 
is 0.1929; and (iii) the rate of convergence of the expected absolute error to zero, 
assumed here to be polynomial. The results are illustrated in Figure 14.2(a). Observe 
that even with a small Monte Carlo sample there is enough information to make a 
satisfactory sample size choice. 

The regression model (14.15) is simple and has the advantage that it yields a 
closed form solution for the approximate optimal sample size, given by 



(14.16) 


The sample shown gives h* — 36. More flexible model specifications may be 
preferable in other applications. 

In this example, we can also rewrite the risk as 



(14.17) 
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Figure 14.2 Monte Carlo estimation of the Bayes risk cur\’e. The points in panel 
(a) have coordinates (n h If. The lines show r{n ,n) using 200 replicates (dotted) and 
2000 replicates (solid). The points in panel (b) have coordinates (n n If. The dashed 
line shows r(tt, n). 


where m is the marginal distribution of the data and n x is the posterior distribution of 
0, that is a mixture of beta densities. Using the incomplete beta function, the inner 
integral can be evaluated directly. We denote the inner integral as /, when (n„xf is 
selected. We can now interpolate directly the points (n h lf, with substantial reduction 
of the variability around the fitted curve. This is clearly recommendable whenever 
possible. The resulting estimated curve r(n,n) is shown in Figure 14.2(b). The new 
estimate is h* = 35, not far from the previous approach, which performs quite well 
in this case. 
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Figure 14.3 Estimation of the Bayes risk curve under quadratic error loss. Small 
points are the sampled values of(8*(x ) — 9) 2 + cn, where <5*(x) denotes the poste¬ 
rior mean. Large points are the posterior variances. The solid line is the fit to the 
posterior variances. 


We end this section by considering two alternative decision problems. The first 
is an estimation problem with squared error loss L(0,a) = (9 — a) 2 . Here the optimal 
terminal decision rule is the posterior mean and the posterior expected loss is the 
posterior variance (see Section 7.6.1). Both of these can be evaluated easily, which 
allows us to validate the Monte Carlo approach. Figure 14.3 illustrates the results. 
The regression model is that of equation (14.15). 

The last problem is the estimation of the expected information V(£ ( " r ). We can 
proceed as before by defining 


V; = log 


/ 7T(fl,|Xi) \ 

V JT(ft) ) 


(14.18) 


and using curve fitting to estimate the expectation of v, as a function of n. In this 
example we choose to use a loess (Cleveland and Grosse 1991) fit for the simulated 
points. Figure 14.4 illustrates various aspects of the simulation. The top panel nicely 
illustrates the decreasing marginal returns of experimentation implied by Theorem 
13.8. The middle panel shows the change in utility as a function of the prior, and 
illustrates how the most useful experiments are those generated by parameter val¬ 
ues that have low a priori probability, as these can radically change one’s state of 
knowledge. 
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Figure 14.4 Estimation of the expected information curve. Panel (a) shows the 
difference between log posterior and log prior for a sample of 9,x,n. Panel (b) 
shows the difference between log posterior and log prior versus the log prior. For 
parameters corresponding to higher values of the log prior, the experiment is less 
informative, as expected. The points are a low-dimensional projection of what is 
close to a three-dimensional surface. Panel (c) shows the log of the posterior ver¬ 
sus the sample size and number of successes x. The only source of variability is the 
value of 0. 


14.3 Examples 

In this section we discuss in detail five different examples. The first four have analytic 
solutions while the last one makes use of the computational approach of Section 14.2. 


14.3.1 Point estimation with quadratic loss 

We begin with a simple estimation problem in which preposterior calculations are 
easy, the solution is in closed form, and it provides a somewhat surprising insight. 

Assume that x lt .. .x„ are conditionally independent N{0,o 2 ), with a 2 known. 
The terminal decision is to estimate 9 with quadratic loss function L{6,a,n) = 
(i 9 — a) 1 + C(n). The prior distribution on 9 is r 0 2 ), with p. 0 and r 0 2 known. 
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Since x n ~ N(9, cr 2 /ri), the posterior distribution of 6 given the observed sample 
mean x n is N{p, x , rf) where, as in Appendix A.4, 


n r n - 


dx 


■ n r n - 


Mo ■ 


and 


■ «T„ 


■ nr,f 


The posterior expected loss, not including the cost of observation, is r 2 . Because 
this depends on the data only via the sample size n but is independent of x„, we have 


a xx 

r(jt,n) = —-5 + c (”)- 

a 2 + n r 0 


(14.19) 


If each additional observation costs a fixed amount c, that is C(n) = c n, replac¬ 
ing n with a continuous variable and taking derivatives we get that the approximate 
optimal sample size is 


n 



(14.20) 


Considering the two integer values next to n* and choosing the one with the smallest 
Bayes risk gives the optimal solution, because this function is convex. Figure 14.5 
shows this optimization problem when r 0 2 —> oo, in which case the optimal sample 
size is n* a/*Jc. 

In equation (14.20), the optimal sample size is a function of two ratios: the ratio 
of experimental standard deviation to cost that characterize the cost-effectiveness of 
a single sample point; and the ratio of experimental variance to prior variance that 
characterizes the incremental value of each observation in terms of knowledge. As 
a result, the optimal sample size increases with the prior variance, because less is 



Figure 14.5 Risk functions and components in the optimal sample size problem of 
Section 14.3.1, when r 0 2 is large compared to o 2 . 
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known a priori, and decreases with cost. However, n* is not a monotone function of 
the sampling variance a 2 . When a 2 is very small few observations suffice; when it 
is large, observations lose cost-effectiveness and a smaller overall size is optimal. 
The largest sample sizes obtain for intermediate values of a 2 . In particular «*, as a 
function of a, is maximized at a — Tq/(2 yfc). 

14.3.2 Composite hypothesis testing 

Assume again that X\,... ,x n is a sample from a N(9,a 2 ) population with cr 2 known, 
and that we are now interested in testing H 0 : 9 < 0 versus //, : 0 > 0. As usual, 
let a, denote the decision of accepting hypothesis H h i = 0,1. While we generally 
study loss function such as that of Table 14.1, a useful alternative for this decision 
problem is 


L(6,a 0 ) 


|0 6 < 0 

1 9 9 > 0 


and L(9,at ) 


[-9 9 < 0 
[0 9 > 0. 


This loss function reflects both whether the decision is correct and whether the actual 
9 is far from 0 if the decision is incorrect. The loss specification is completed by 
setting L(9,a,n) = L(9,a ) + C(n). Assume that the prior density of 9 is A(/r 0 , r 0 2 ) 
with /x„ = 0. To determine the optimal sample size we first determine the optimal 
S*(x") by minimizing the posterior expected loss C. Then we perform the preposterior 
analysis by evaluating C at the optimal S*(x") and averaging the result with respect 
to the marginal distribution of the data. 

The sample mean x„ is sufficient for x", so tt(9\x") = tt(9\x„) and we can handle 
all calculations by simply considering x„. The posterior expected losses for a 0 and 
are, respectively. 




/»00 

/ 9n{9\x„)d9 

Jo 

— f 9jt(9\x n )d9, 

J —oo 


so that the Bayes rule is a 0 whenever 



9jt(9\x„)d9 < — 



9jt(9\x„)d9 


and a\, otherwise. This inequality can be rewritten in terms of the posterior mean: 
we choose a 0 whenever 



9n(9\x n )d9 + 



6n{9\x„)d9 = E[9\x„] < 0. 
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In this example 

nr„ 2 

E[9 \x n ] = x„ —— - j 

<J 2 + HT 0 

because /a 0 = 0, so we decide in favor of a 0 (ai) if x„ < (>)0. In other words, the 
Bayes rule is 


KG) = 


I flo 
«1 


x„ < 0 
x n > 0. 


This defines the terminal decision rule, that is the optimal strategy after the data 
have been observed, as a function of the sample size and sufficient statistic. Now we 
need to address the decision concerning the sample size. The posterior expected loss 
associated with the terminal decision is 




| / 0 °° 9n (6 \x n )d6 x n < 0 
| - 0n(Q | x n )d0 x n > 0. 


The Bayes risk for the sample size decision is the cost of experimentation plus the 
expectation of the expression above with respect to the marginal distribution of x„. 
Specifically, 


r{n,n) 



On (6 1 x„)d9 m(x n )dx n 



6n(0\x„)d0 m(x n )dx n + C(n). 


/*0 f * 0 _ _ _ 

By adding and subtracting J_ 00 J_ 00 0n{6\x n )m{x n )d0dx n and changing the order of 
integration in the first term we get 


r(n,n) 



m(x„)E[6 |x„]r/.r„ 



0n(6)d0 + C{ri) 


nr: 


■ nr r 


f 


x„m(x n )dx n 



9n(0)d6 + C(n). 


Because 9 ~IV(0, r 0 2 ) then 9n{9)d0 = —T Q /y/2n. Also, the marginal distribution 
of the sample mean x„ is 1V(0, (r^ n + a 2 )/n). Thus, 



x n m(x n )dx n 



and 



r(n, n) 


I - T 0 


r 0 2 n + a 2 


+ C(n). 
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Figure 14.6 The Bayes risk for the composite hypothesis testing problem. 


In particular, if a 2 = 1, r 0 2 = 4, and the cost of a fixed sample of size n is C(n) = 
0.01 n then 

Figure 14.6 shows r(it,n) as a function of n taken to vary continuously. Among 
integer values of n the optimal sample size turns out to be 3. 

14.3.3 A two-action problem with linear utility 

This example is based on Raiffa and Schlaifer (1961). Consider a health care agency, 
say the WHO, facing the choice between two actions a 0 and a t , representing, for 
example, two alternative prevention programs. Both programs have a setup cost for 
implementation K t , i = 0,1, and will benefit a number k, of individuals. Each individ¬ 
ual in either program can expect a benefit of x additional units of utility, for example 
years of life translated into monetary terms. In the population, the distribution of x 
is normal with unknown mean 9 and known variance o 2 . The utility to the WHO of 
implementing each of the two programs is 

u(af6)) = Kj + kfi, for i = 0, 1 with k 0 < k { . (14.21) 

We will assume that the WHO has the possibility of gathering a sample x" from 
the population, to assist in the choice of a program, and we are going to study the 
optimal sample size for this problem. We are going to take a longer route than strictly 
necessary and first work out in detail both the expected value of perfect information 
and the expected value of experimental information for varying n. We are then going 
to optimize the latter assuming that the expected cost sampling is C(n ) = cn. 

Throughout, we will assume that 9 has a priori normal distribution with mean 
pt o and variance Tq = a 2 /n 0 . Expressing r 0 2 in terms of the population variance will 
simplify analytical expressions. 
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Let 9 b be the “break-even value,” that is the value of 9 under which the two 
programs have the same utility. Solving u(a o (0 b )) = u(ai(9 b )) leads to 


*1 - *b ' 

For a given value of 0 , the optimal action a„ is such that 


| flo if o <e b 
jfli if 9 > 9 b . 


(14.22) 


As 9 is unknown, a g is not implementable. 

Before experimentation, the expected utility of action a, is E\ii(a,(6))] = 
Kj + k,n n . Thus, the optimal decision is 


I Go if Mo < 9 b 

a i if Mo > @b- 


(14.23) 


As seen in Chapter 13, the conditional value of perfect information for a given 
value 9 is 

V e (£°°) = u(a e m - u(a*(9)). (14.24) 

Using the linear form of the utility function and the fact that 


(A, - K 0 ) = -(k t - k 0 )9 b , 


(14.25) 


we obtain 


V e (£°°) 


u(a 0 (9)) - m(oi(0)) = (ki - k 0 )(9 b - 9), 
u(ai(9)) - u(a 0 (9)) = (ki - k 0 ){9 - 9 b ), 
0 , 


if 9 < 9 b , /x 0 > 9 b 
if 9 > 9 b , /x 0 < 9 b 

otherwise. 

(14.26) 


In Chapter 13, we saw that the expected value of perfect information is 


V(£“) = E e [V B (£°°)l 


(14.27) 


It follows from (14.26) that 


V(£°°) 


J (k t - k 0 ) f£(0 - 9 h )7T(9)d9, if fi 0 < 9 b 
1 (/c, - ko) fl b J9 b - 9)n(9)d9, if > 9 b . 


We define the following integrals: 


A(0 b ) = 


B(9 b ) = 



(i 9 - 9 b )n(9)d9 
(0 b - 9)jt(9)d9. 


(14.28) 


(14.29) 
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Using the results of Problem 14.5 we obtain 

V(£“) = (*, - k 0 )S(z)a/^T 0 , (14.30) 

where z = ■ s /n~a/o\0 b — fj, 0 1, £(z) = — z( 1 — 3>(z)), where <f> and denote, 

respectively, the density and the cdf of the standard normal distribution. 

We now turn our attention to the situation in which the decision maker performs 
an experiment £ (n) , observes a sample x" = (x,,..., x„), and then chooses between 
a 0 and a,. The posterior distribution of 0 given x" is normal with posterior mean 
fi x = (n 0 /r 0 + nx„)/(n 0 + n) and posterior variance r) 2 = u 2 /(n 0 + n). Thus, the 
optimal “a posteriori” decision is 


w = 


I Oq 

«i 


if M* < 6*6 
if m* > 


(14.31) 


Recalling Chapter 13, the value of performing the experiment £ (n> and observing 
x" as compared to no experimentation is given by the conditional value of the sam¬ 
ple information, or observed information, V x »(£ (n) ). The expectation of it taken with 
respect to the marginal distribution of x" is the expected value of sample information, 
or expected information. 

In this problem, because the terminal utility function is linear in 0, the observed 
information is functionally similar to the conditional value of perfect information; 
that is, the observed information is 


VA£ m ) = V(ir x ) - UJa*). (14.32) 

One can check that the expected information is V(£ ( '°) = £ , fll (V r /i(£’ ( ' l) )). This expec¬ 
tation is taken with respect to the distribution of ji x , that is a normal with mean /x 0 
and variance <r 2 /n v where 

1 _ i 1 
n x n 0 n 0 + n 

Following steps similar to those used in the calculation of (14.30), it can be shown 
that the expected information is 


W£(nh = j(*i - £o )A llx (6b) if Mo < 0 h 

I (k, - k 0 ) B llx (0h) if Mo > 0 b 
= (ki - k 0 )C(z)CT/V«I, 


(14.33) 


where z= 1 0 b - fi 0 \^/rT x /a. 

The expected net gain of an experiment that consists of observing a sample of 
size n is therefore 


U n (ji) — V(£ M ) - cn = (ki - k 0 )l;(z)(T/y/n x - cn. 


( 14 . 34 ) 
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Let d=(k { —ko)a/{cyfn^),q — n/n 0 , andz 0 = \d b — /x 0 | ^/n^/a. The “dimension¬ 
less net gain” is defined by Raiffa and Schlaifer (1961) as 


g(d, q, Zo) 


U n (ri) 

cn 0 


= d 






- q- 


(14.35) 


To solve the sample size problem we can equivalently maximize g with respect to 
q and then translate the result in terms of n. Assuming q to be a continuous variable 
and taking the first- and second-order derivatives of g with respect to q gives 

g\d, q, Zo) = { -dq-' /2 (q+ 1 Y i/2 <p ^ _ 1 

g"(d, q, Zo) = \d[q{q + l)]- 5/ > ( z 0 (^) ) ^ + (zg - D q~ Vl- (14.36) 


The first-order condition for a local maximum is that g'(d, q, zo) = 0. The func¬ 
tion g always admits a local maximum when zo = 0. However, when z 0 > 0 a local 
maximum may or may not exist (see Problem 14.6). 

By solving the equation g'(d, q, z 0 ) = 0 we obtain 


<7 1/2 (<7+1) 3/ Vo/^ 


d 

■^4>(zo) 



approaches 1 for large q 



The large-g approximate solution is then q* — y/<p(za)d/2. The implied sample size is 
n* = n oy J<p(z 0 )d/2 = ^(pizoX^ - k 0 )c~ l (jn 0 l/2 /2. (14.37) 


14.3.4 Lindley information for exponential data 

As seen in Chapter 13, if the data to be collected are exchangeable, then the expected 
information T{£ (n) ) is increasing and convex in n. If, in addition to this, the cost of 
observation is also an increasing function in n, and information and cost are additive, 
then there are at most two (adjacent) solutions. We illustrate this using the following 
example taken from Parmigiani and Berry (1994). 

Consider exponential survival data with unknown failure rate 6 and conjugate 
prior Gamma(a 0 , pi,)- Here can be interpreted as the number of events in a hypo¬ 
thetical prior study, while /3 0 can be interpreted as the total time at risk in that study. 
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Let t = Y", x i be the total survival time. The expected information in a sample of 
size n is 


K£ <n> ) = / / log 

Jo Jo 

6 a ° 

X ——- G ao+n - 1 t" +l e- ew o +,) dtdO. (14.38) 

r(ao)T(n) 

This integration is a bit tedious but can be worked out in terms of the gamma function 
T and the digamma function \l/ as 

1(£ M ) = log + (a 0 + n)i/r(a 0 + n) - a 0 i[r(a 0 ) - n. (14.39) 

T(a 0 + n) 

The result depends on the number of prior events a 0 but not on the prior total time at 
risk j8 0 . 

From (14.38), the expected information for one observation is 

l(£ m ) = — + \jr{a 0 ) - log o! 0 . 

“o 


T(a 0 )(/lo + t) a ° + " . 

T(a 0 + n)p a 0 ° 


In particular, if a 0 = 1, then T(£ (V> ) = 1 — C = 0.4288, where C is Euler’s constant. 
Using a first-order Stirling approximation we get 


log T (n) 


1 

- log 2jt — n + 



log n 


and 

1 

i Kn) ^ logn - —. 

In 

Applying these to the expression of the expected information gives 


J0 n] ) « A-(a 0 ) + \ log(« 0 + n) 

where K(a 0 ) = log r(a 0 ) - a 0 (^ (a Q ) - 1) - \ log 27r - \. 

Using this approximation and taking n to be continuous, when the sampling cost 
per observation is c (in information units), the optimal fixed sample size is 


2c 


a 0 . 


The optimal solution depends inversely on the cost and linearly on the prior sample 


size. 
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14.3.5 Multicenter clinical trials 

In a multicenter clinical trial, the same treatment is carried out on different pop¬ 
ulations of patients at different institutions or locations (centers). The goals of 
multicenter trials are to accrue patients faster than would be possible at a single 
center, and to collect evidence that is more generalizable, because it is less prone to 
center-specific factors that may affect the conclusions. Multicenter clinical trials are 
usually more expensive and difficult to perform than single-center trials. From the 
standpoint of sample size determination, two related questions arise: the appropriate 
number of centers and the sample size in each center. In this section we illustrate 
a decision-theoretic approach to the joint determination of these two sample sizes 
using Bayesian hierarchical modeling. Our discussion follows Parmigiani and Berry 
(1994). 

Consider a simple situation in which each individual in the trial receives an exper¬ 
imental treatment for which there is no current alternative, and we record a binary 
outcome, representing successful recovery. Within each center i, we will assume that 
patients are exchangeable with an unknown success probability 0, that is allowed to 
vary from center to center as a result of differences in the local populations, in the 
modalities of implementation of care, and in other factors. In a hierarchical model 
(Stangl 1995, Gelman et al. 1995), the variability of the center-specific parameters 
Oj is described by a further probability distribution, which we assume here to be a 
Beta(a,p ) where a and ft are now unknown. The (), are conditionally independent 
draws from this distribution, implying that we have no a priori reasons to suppose that 
any of the centers may have a higher propensity to success. The problem specification 
is completed by a prior distribution jr(a, ft) on (a, ft). 

To proceed, we assume that the number of patients in each center is the same, that 
is n. Let k denote the number of centers in the study. Because the treatment would 
eventually be applied to the population at large, our objective will be to learn about 
the general population of centers rather than about any specific center. This learning 
is captured by the posterior distribution of a and p. Thus we will optimally design 
the pair (n, k ) to maximize the expected information! on a and ft. Alternatively, one 
could include explicit consideration of one or more specific centers. See for example, 
Mukhopadhyay and Stangl (1993). 

Let x be a /:-dimensional vector representing the number of successes in each 
center. The marginal distribution of the data is 



and 


7r(a,ft\x) = 


JT(a, ft )nL B(a +Xi,P + n - x,)(") 


m(x)B k (a, P) 


where B is the beta function. 
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The following results consider a prior distribution on a and ( J > given by 
independent and identically distributed gamma distributions with shape 6 and 
rate 2, corresponding to relatively little initial knowledge about the population 
of centers. This prior distribution is graphed in Figure 14.7(a). The implied 
prior predictive distribution on the probability of success in a hypothetical new 
center k + 1, randomly drawn from the population of centers, is shown in 
Figure 14.7(b). 

Figure 14.8 shows the posterior distributions obtained by three different alloca¬ 
tions of 16 patients: all in one center ( n= 16, k — 1), four in each of four centers 
(n = 4,k = 4), and all in different centers (n = 1 ,k = 16). All three posteriors are 
based on a total of 12 observed successes. As centers are exchangeable, the pos¬ 
terior distribution does not depend on the order of the elements of x. So for k = 1 and 
k = 16 the posterior graphed is the only one arising from 12 successes. For k = 4 the 
posterior shown is one of six possible. 

Higher values of the number of centers k correspond to more concentrated pos¬ 
terior distributions; that is, to a sharper knowledge of the population of centers. A 
different way of illustrating the same point is to graph the marginal predictive dis¬ 
tribution of the success probability 6 k+i in a hypothetical future center, for the three 
scenarios considered. This is shown in Figure 14.9. 

Using the computational approach described in Section 14.2 and approximating 
k and n by real variables, we estimated the surface l(£ (n,k) ). Figure 14.10 gives the 
contours of the surface. Choices that have the same total number kn of patients are 
represented by dotted lines. The expected information increases if one increases k 
and decreases n so that kn remains fixed. 

If each patient, in information units, costs a fixed amount c, and if there is a fixed 
“startup” cost s in each center, the expected utility function is 


UJn, k ) = T(f'"'«) - cnk - sk. 


For example, assume s = 0.045 and c = 0.015. A contour plot of the utility surface 
is shown in Figure 14.11. This can be used to identify the optimum, in this case the 
pair (n = \\,k = 6). 

More generally, the entire utility surface provides useful information for planning 
the study. In practice, planning complex studies will require informal consideration 
of a variety of constraints and factors that are difficult to quantify in a decision- 
theoretic fashion. Examining the utility surface, we can divide the decision space 
into regions with positive and negative utility. In this example, some choices of k 
and n have negative expected utility and are therefore worse than no sampling. In 
particular, all designs with only one center (the horizontal line at k = 1) are worse 
than no experimentation. We can also identify regions that are, say, within 10% of the 
utility of the maximum. This parallels the notion of credible intervals and provides 
information on robustness of optimal choices. 
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THETA k+1 

(b) 

Figure 14.7 Prior distributions. Panel (a), prior distribution on a and ft. Panel (b), 
prior predictive distribution of the success probability in center k + 1. Figure from 
Parmigiani and Berry (1994). 
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£ = 16, n = l, jt = (l* 1, 1, l. 1, l, 1, l. I, l, l, 1,0,0,0,01 



. k = A. n =4, x = (4, 3, 3, 2) 



i = l, n = 16,x=12 



Figure 14.8 Posterior distributions on a and f> under alternative designs; all are 
based on 12 successes in 10 patients. Figure from Parmigiani and Berry (1994). 
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Figure 14.9 Marginal distributions of the success probability 9 k+1 under alterna¬ 
tive designs; all are based on 12 successes in 16 patients. Outcomes are as in Figure 
14.8. Figure from Parmigiani and Berry (1994). 



Figure 14.10 Contour plot of the estimated information surface as a function of 
the number of centers k and the number of patients per center n; dotted lines identify 
designs with the same number of patients, respectively 10, 25, and 50. Figure from 
Parmigiani and Berry (1994). 
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Figure 14.11 Contour plot of the estimated utility surface as a function of the num¬ 
ber of centers k and the number of patients per center n. Pairs inside the O-level curve 
are better than no experimentation. Points outside are worse. Figure from Parmigiani 
and Berry (1994). 


14.4 Exercises 

Problem 14.1 A medical laboratory must test N samples of blood to see which have 
traces of a rare disease. The probability that any individual from the relevant popu¬ 
lation has the disease is a known 0. Given 0, individuals are independent. Because 
0 is small it is suggested that the laboratory combine the blood of a individuals into 
equal-sized pools of n = N/a samples, where a is a divisor of N. Each pool of sam¬ 
ples would then be tested to see if it exhibited a trace of the infection. If no trace 
were found then the individual samples comprising the group would be known to be 
uninfected. If on the other hand a trace were found in the pooled sample it is then 
proposed that each of the a samples be tested individually. Discuss finding the opti¬ 
mal a when N = 4, 6 = 0.02, and then when N = 300. What can you say about the 
case in which the disease frequency is an unknown 01 


We work out the known 0 case and leave the unknown 0 for you. First we assume 
that N — 4, 6 = 0.02, and that it costs $ 1 to test a sample of blood, whether pooled or 
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unpooled. The possible actions are a = 1, a = 2, and a — 4. The possible experimental 
outcomes for the pooled samples are: 

(a) If a — 4 is chosen 

x 4 = 1 Positive test result. 
x 4 = 0 Negative test result. 

(b) If a — 2 is chosen 

x 2 = 2 Positive test result in both pooled samples. 

x 2 = 1 Positive test result in exactly one of the pooled samples. 

x 2 = 0 Negative test result in both pooled samples. 

Recalling that N — 4, 9 — 0.02, and that nonoverlapping groups of individuals 
are also independent, we get 

f(x 4 = 1|0) = 1 - (1 - 0.02) 4 = 0.0776 

f(x 4 = 0|6>) = (1 - 0.02) 4 = 0.9224 

f (x 2 = 2\Q) = [1 - (1 - 0.02) 2 ] 2 = 0.0016 

f(x 2 = 1|0) = (1 - 0.02) 4 = 0.9224 

f(x 2 = O|0) = 1 -f(x 2 = 1|0) -f(x 2 = O|0) = 0.076. 

Figure 14.12 presents a simple one-stage decision tree for this problem, solved by 
using the probabilities just obtained. The best decision is to take one pooled sample 
of size 4. 


(-4.0000) i .0000 



-$4 


-$5 


-$1 


-$6 


-$4 


-$2 


Figure 14.12 Decision tree for the pooling example. 
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Now, let us assume that N = 300. We are going to graph the optimal a for 9 as 
0.0,0.1,0.2,0.3,0.4, or 0.5. The cost is still $1. For all a > 1 divisors of N let us 
define n a — N/a and x a the number of subgroups that exhibited a trace of infection. 
For a = 1 the cost will be N dollars for sure. Therefore, the possible values of x a 
are {0,1,..., n a ). Let us also define Q„ as the probability that a particular subgroup 
exhibits a trace of infection, which is the same as the probability that at least one of 
the a individuals in that subgroup exhibits a trace of infection. Then 

0„ = 1 — (1 — 9) a (14.40) 

where, again, 9 is the probability that a particular individual presents a trace of 
infection. Finally, 

x a ~ Binomial(n a ,9 a ). (14.41) 

If x a = 0 then only one test is needed for each group and the cost is C„ = n a . 
If x a = 1, then we need one test for each group, plus a tests for each individual in 
the specific group that presented a trace of infection. Thus, defining C a as the cost 
associated with decision a we have 

C a = n a + ax a . (14.42) 

Also, C a = n a + an„ if x a — n a . Hence, 

E[C a \6] =n a + aE[x a \6] 

— n a + an a 9 a 
= N/a + N[ 1 -(1-0)“] 


or, factoring terms. 


E[C a \9] = N[\ + \/a-{\-9) a ] a > 1. (14.43) 

Table 14.2 gives the values of E[C a \9 ] for all the divisors of N — 300 and for a 
range of values of 9. Figure 14.13 shows a versus E[C a \6] for the six different values 
of 9. If 9 = 0.0 nobody is infected, so we do not need to do anything. If 9 = 0.5 it 
is better to apply the test for every individual instead of pooling them in subgroups. 
The reason is that 9 is high enough that many of the groups will be retested with high 
probability. Table 14.3 shows the optimal a for each value of 9. 

Problem 14.2 Suppose that x 1; x 2 ,... are conditionally independent Bernoulli 
experiments, all with success probability 9. You are interested in estimating 9 and 
your terminal loss function is 


UP, a) 


(9 — a) 2 
9(1-9) 
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Table 14.2 Computation of E[C a \6] for different values of 6. The cost associated 
with optimal solutions is in bold. If 0 = 0 no observation is necessary. 


a 

0.0 

0.1 

0.2 

0.3 

0.4 

0.5 

i 

300 

300 

300 

300 

300 

300 

2 

150 

207 

258 

303 

342 

375 

3 

100 

181 

246 

297 

335 

362 

4 

75 

178 

252 

302 

336 

356 

5 

60 

182 

261 

309 

336 

350 

6 

50 

190 

271 

314 

336 

345 

10 

30 

225 

297 

321 

328 

329 

12 

25 

240 

304 

320 

324 

324 

15 

20 

258 

309 

318 

319 

319 

20 

15 

278 

311 

314 

314 

314 

25 

12 

290 

310 

311 

311 

312 

30 

10 

297 

309 

309 
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310 

50 

6 

304 

305 

306 

306 

306 

60 

5 

304 

304 

305 

305 

305 

75 

4 

303 

304 

304 

304 

304 

100 

3 

302 

303 

303 

303 

303 

150 

2 

302 

302 

302 

302 

302 

300 

1 

301 

301 

301 

301 

301 



Figure 14.13 Computation of E[C a \6] for different values of 6. 
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Table 14.3 Optimal decisions for different values of 6. 


e 

0.1 

0.2 

0.3 

0.4 

0.5 

Optimal a 

4 

3 

3 

1 

1 


This is the traditional squared error loss, standardized using the variance of a sin¬ 
gle observation. With this loss function, errors of the same absolute magnitude are 
considered worse if they are made near the extremes of the interval (0,1). You have a 
Beta( 1.5,1.5) prior on 0. Each observation costs you c. What is the optimal fixed 
sample size? You may start using computers at any stage in the solution of this 
problem. 

Problem 14.3 You are the Mayor of Solvantville and you are interested in esti¬ 
mating the difference in the concentration of a certain contaminant in the water 
supply of homes in two different sections of town, each with its own well. If you 
found that there is a difference you would further investigate whether there may be 
contamination due to industrial waste. How big a sample of homes do you need 
for your purposes? Specify a terminal loss function, a cost of observation, a like¬ 
lihood, and a prior. Keep it simple, but justify your assumptions and point out 
limitations. 

Problem 14.4 In the context of Section 14.3.2 determine the optimal n by 
minimizing r(jt,n). 

Problem 14.5 Suppose x has a normal distribution with mean 0 and variance a 2 . 
Let/(.) denote its density function. 

(a) Prove that f“ xf(x)dx = 6 )< l , (z) — er</>(z), where z = (u — 6)/o, and </> and <J> 
denote, respectively, the density and the cumulative distribution function of 
the standard normal distribution. 

(b) Define £(y) = (p(y) - ><1 - <t>(y)). Prove that £(-y) = y + £(y). 

(c) Let A{u ) = — u)f(x)dx and B(u ) = f“ (u — x) f(x)dx be, respectively, 

the right and left hand side linear loss integrals. Let z = (w — 0)/a. Prove 
that 

(i) A(u ) = ct£(z). 

(ii) B(u) = (x^-z). 

Problem 14.6 In the context of the example of Section 14.3.3: 

(a) Derive the function g in equation (14.35) and prove that its first- and second- 
order derivatives with respect to r are as given in (14.36). 
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(b) Prove that when Zo > 0, a local maximum may or may not exist (that is, in 
the latter case, the optimal sample size is zero). 

Problem 14.7 Can you find the optimal sample size in Example 14.3.1 when cr 
is not known? You can use a conjugate prior and standard results such as those of 
Sec. 3.3 in Gelman et al. (1995). 



15 


Stopping 


In Chapter 14 we studied how to determine an optimal fixed sample size. In that 
discussion, the decision maker selects a sample size before seeing any of the data. 
Once the data are observed, the decision maker then makes a terminal decision, typ¬ 
ically an inference. Here, we will consider the case in which the decision maker 
can make observations one at a time. At each time, the decision maker can either 
stop sampling (and choose a terminal decision), or continue sampling. The general 
technique for solving this sequential problem is dynamic programming, presented in 
general terms in Chapter 12. 

In Section 15.1, we start with a little history of the origin of sequential sam¬ 
pling methods. Sequential designs were originally investigated because they can 
potentially reduce the expected sample size (Wald 1945). Our revisitation of Wald’s 
sequential stopping theory is almost entirely Bayesian and can be traced back to 
Arrow et al. (1949). In their paper, they address some limitations of Wald’s proof 
of the optimality of the sequential probability ratio test, derive a Bayesian solu¬ 
tion, and also introduce a novel backward induction method that was the basis for 
dynamic programming. We illustrate the gains that can be achieved with sequential 
sampling using a simple example in Section 15.2. Next, we formalize the sequen¬ 
tial framework for the optimal choice of the sample size in Sections 15.3.1, 15.3.2, 
and 15.3.3. In Section 15.4, we present an example of optimal stopping using the 
dynamic programming technique. 

In Section 15.5, we discuss sequential sampling rules that do not require specify¬ 
ing the costs of experimentation, but rather continue experimenting until enough 
information about the parameters of interest is gained. An example is to sample 
until the posterior probability of a hypothesis of interest reaches a prespecified 
level. We examine the question of whether such a procedure could be used to 
design studies that always reach a foregone conclusion, and conclude that from a 
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Bayesian standpoint, that is not the case. Finally, in Section 15.6 we discuss the 
role of stopping rules in the terminal decision, typically inferences, and connect our 
decision-theoretic approach to the Likelihood Principle. 

Featured articles: 

Wald, A. (1945). Sequential tests of statistical hypotheses, Annals of Mathematical 
Statistics 16: 117-186. 

Arrow, K., Blackwell, D. and Girshick, M. (1949). Bayes and minimax solutions of 
sequential decision problems, Econometrica 17: 213-244. 

Useful reference texts for this chapter are DeGroot (1970), Berger (1985), and 
Bernardo and Smith (2000). 


15.1 Historical note 

The earliest example of a sequential method in the statistical literature is, as far as we 
know, provided by Dodge and Romig (1929) who proposed a two-stage procedure 
in which the decision of whether or not to draw a second sample was based on the 
observations of the first sample. It was only years later that the idea of sequential 
analysis would be more broadly discussed. In 1943, US Navy Captain G. L. Schuyler 
approached the Statistical Research Group at Columbia University with the problem 
of determining the sample size needed for comparing two proportions. The sample 
size required was very large. In a letter addressed to Warren Weaver, W. Allen Wallis 
described Schuyler’s impressions: 

When I presented this result to Schuyler, he was impressed by the large¬ 
ness of the samples required for the degree of precision and certainty 
that seemed to him desirable in ordnance testing. Some of these samples 
ran to many thousands of rounds. He said that when such a test pro¬ 
gram is set up at Dahlgren it may prove wasteful. If a wise and seasoned 
ordnance expert like Schuyler were on the premises, he would see after 
the first few thousand or even few hundred [rounds] that the experiment 
need not be completed, either because the new method is obviously infe¬ 
rior or because it is obviously superior beyond what was hoped for. He 
said that you cannot give any leeway to Dahlgren personnel, whom he 
seemed to think often lack judgement and experience, but he thought it 
would be nice if there were some mechanical rule which could be speci¬ 
fied in advance stating the conditions under which the experiment might 
be terminated earlier than planned. (Wallis 1980, p. 325) 

W. Allen Wallis and Milton Friedman explored this idea and came up with the 
following conjecture: 

Suppose that N is the planned number of trials and W N is a most powerful 
critical region based on N observations. If it happens that on the basis of 
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the first n trials (n < N ) it is already certain that the completed set of 
N must lead to a rejection of the null hypothesis, we can terminate the 
experiment at the n-trial and thus save some observations. For instance, 
if W N is defined by the inequality x, + ... + x\, > c, and if for some 
n < N we find that x\ + ... + x 2 n > c, we can terminate the process 
at this stage. Realization of this naturally led Friedman and Wallis to 
the conjecture that modifications of current tests may exist which take 
advantage of sequential procedure and effect substantial improvements. 

More specifically, Friedman and Wallis conjectured that a sequential test 
may exist that controls the errors of the first and second kinds to exactly 
the same extent as the current most powerful test, and at the same time 
requires an expected number of observations substantially smaller than 
the number of observations required by the current most powerful test. 

(Wald 1945, pp. 120-121) 

They first approached Wolfowitz with this idea. Wolfowitz, however, did not 
show interest and was doubtful about the existence of a sequential procedure that 
would improve over the most powerful test. Next, Wallis and Friedman brought this 
problem to the attention of A. Wald who studied it and, in April of 1943, proposed 
the sequential probability ratio test. In the problem of testing a simple null versus a 
simple alternative Wald claimed that: 

The sequential probability ratio test frequently results in a saving of 
about 50% in the number of observations as compared with the current 
most powerful test. (Wald 1945, p. 119) 

Wald’s finding was of immediate interest, as he explained: 

Because of the substantial savings in the expected number of obser¬ 
vations effected by the sequential probability ratio test, and because 
of the simplicity of this test procedure in practical applications, the 
National Defense Research Committee considered these developments 
sufficiently useful for the war effort to make it desirable to keep the 
results out of the reach of the enemy, at least for a certain period of time. 

The author was, therefore, requested to submit his findings in a restricted 
report. (Wald 1945, p. 121) 

It was only in 1945, after the reports of the Statistical Research Group were no longer 
classified, that Wald’s research was published. Wald followed his paper with a book 
published in 1947 (Wald 1947b). 

In Chapter 7 we discussed Wald’s contribution to statistical decision theory. In 
our discussion, however, we only considered the situation in which the decision 
maker has a fixed number of observations. Extensions of the theory to sequential 
decision making are in Wald (1947a) and Wald (1950). Wald explores minimax and 
Bayes rules in the sequential setting. The paper by Arrow et al. (1949) is, however. 
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the first example we know of utilizing backwards induction to derive the formal 
Bayesian optimal sequential procedure. 


15.2 A motivating example 

This example, based on DeGroot (1970), illustrates the insight of Wallis and Fried¬ 
man, and shows how a sequential procedure improves over the fixed sample size 
design. 

Suppose it is desired to test the null hypothesis H 0 : 0 — 0 o versus the alternative 
Hi : 9 = 0i. Let a, denote the decision of accepting hypothesis //,, ; = 0,1. Assume 
that the utility function is 


u(ai(0)) =-I laim . (15.1) 

The decision maker can take observations x,-, i = 1,2,.... Each observation costs 
c units and it has a probability a of providing an uninformative outcome, while in 
the remainder of the cases it provides an answer that reveals the value of 0 without 
ambiguity. Formally x, has the following probability distribution: 


mm = 


(1 — a), if 0 = 0 O and x = 1, or 9 = 9 1 and x = 2 
0, if 0 = 0 O and x = 2, or 0 = 0 3 and x = 1 

a, if x = 3. 


(15.2) 


Let 7r = tc (0 = 0 O ) denote the prior probability of the hypothesis /7 0 , and assume that 
it < 1/2. If no observations are made, the Bayes decision is and the associated 
expected utility is — tt. Now, let us study the case in which observations are available. 
Let y x count the number of observations with value x, with x = 1,2,3. Given a 
sequence of observations x" = (xi,... ,x„), the posterior distribution is 

f 1, if y 2 = 0 and y 3 < n 

n > = it (0 = 0 O |Y) =- = 0, it y 2 > 0 (15.3) 

mix") 

tt, if y 3 = n. 

There is no learning about 0 if all the observed x are equal to 3, the uninformative 
case. Flowever, in any other case, the value of 0 is known with certainty. It is futile to 
continue experimenting after x — 1 or x = 2 is observed. For sequences x" that reveal 
the value of 0 with certainty, the expected posterior utility, not including sampling 
costs, is 0. When = n, the expected posterior utility is —tc . Thus, the expected 
utility of taking n observations, accounting for sampling costs, is 


Ujtin) = —na" — cn. (15.4) 

Suppose that the values of c and a are such that U n (Y) > 74(0), that is it is 
worthwhile to take at least one observation, and let n* denote the positive integer 
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that maximizes equation (15.4). An approximate value is obtained 
continuity in n. Differentiating with respect to n, and setting the result 




to 0, gives 


log ( - 
a 


log 


7Tlog(l/a) 


(15.5) 


The decision maker can reduce costs of experimentation by taking the n* obser¬ 
vations sequentially with the possibility of stopping as soon as he or she observes a 
value different from 3. Using this procedure, the sample size A is a random variable. 
Its expected value is 


E[N\9 0 ] = yjP(N=j\9=9 0 ) 

7=1 

n*~ 1 1 n * 

= T ja j ~ l ( 1 — a) + if a"*- 1 = -. (15.6) 

j= i 

This is independent of 9, so fiWIdi] = £[jV|0 o ]- Thus, £[A] = (l — a "*) / (1 — a). 

This sequential sampling scheme imposes an upper bound of n* observations on 
the expected value of N. To illustrate the reduction on the number of observations 
when using the sequential sampling, let a — 1 /4, c = 0.000 001, n = 1/4. We obtain 
n* = 10 and E\ /V] = 1.33; that is, on average, we have a great reduction in the sample 
size needed. 

Figure 15.1 illustrates this further using a simulation. We chose values of 9 from 
its prior distribution with tc = 1/2. Given 9, we sequentially simulated observations 
using (15.2) with a = 1/4 until we obtained a value different from 3, or reached 
the maximum sample size n* and recorded the required sample size. We repeated 


O 


00 - 



0 20 40 60 80 100 


Simulation 


Figure 15.1 Sample size for 100 simulated sequential experiments in which 
n — \/2 and a — 1/4. 
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this procedure 100 times. In the figure we show the resulting number of observations 
with the sequential procedure. 

15.3 Bayesian optimal stopping 

15.3.1 Notation 

As in Chapter 14, suppose that observations x I ,x 2 ,. .. become available sequentially 
one at a time. After taking n observations, that is observing x" = (x t ,... ,x„), the 
joint probability density function (or joint probability mass function depending on 
the application) is f(x n \0). As usual, assuming that n is the prior distribution for 9, 
after n observations, the posterior density is tt(0 |v") and the marginal distribution of 
x" is 



The terminal utility function associated with a decision a when the state of nature 
is 9 is u{a(9)). The experimental cost of taking n observations is denoted by C(ri). 
The overall utility is, therefore, 


u(9,a,n ) = u(a{9)) — C(n). 


The terminal decision rule after n observations (that is, after observing x") is denoted 
by 8„(x"). As in the nonsequential case, for every possible experimental outcome x", 
8 n (x") tell us which action to choose if we stop sampling after x". In the sequential 
case, though, a decision rule requires the full sequence 8 = {^(jc 1 ), S 2 (xr ),...} of 
terminal decision rules. Specifying 8 tells us what to do if we stop sampling after 
1,2,... observations have been made. But when should we stop sampling? 

Let £„ (x") be the decision rule describing whether or not we stop sampling after 
n observations, that is 


1, stop after observing x" 
0, continue sampling. 


If £ 0 = 1 there is no experimentation. We can now define stopping rule, stopping 
time, and sequential decision rule. 

Definition 15.1 The sequence £ = {£ 0 , fiC* 1 ), (hC* 2 ), • • ■} is called a stopping rule. 

Definition 15.2 The number N of observations after which the decision maker 
decides to stop sampling and make the final decision is called stopping time. 

The stopping time is a random variable whose distribution depends on £. Specifically, 


N — min{« > 0, such that £„(x") = 1}. 
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We will also use the notation {IV = n] for the set of observations (x l ,x 2 , ...) such 
that the sampling stops at n, that is for which 

Co = 0, CiU 1 ) = 0, ..., = 0, £„(*") = 1. 

Definition 15.3 A sequential decision rule d consists of a stopping rule and a deci¬ 
sion rule, that is d = (£, 6). If P e (N < oo) = 1 for all values of 6, then we say that 
the sequential stopping rule £ is proper. 

From now on, in our discussion we will consider just the class of proper stopping 
rules. 

15.3.2 Bayes sequential procedure 

In this section we extend the definition of Bayes rule to the sequential framework. 
To do so, we first note that 


P 0 (N< oo) = J2 p e(N = n) 

n =0 


= Pe(N = 0) + 



f(x"\e)dx". 


The counterpart of the frequentist risk R in this context is the long-term average 
utility of the sequential procedure d for a given 0, that is 


U{6,d) = ul9, S 0 ,0 )P e (N = 0) + 



u{e,8{xr\n)f{xr\e)dxr 


OO OO 

= u(8 0 (0))Pg(N =0) + J2 U{0, 8„) - C(n)P e (N = n), 

n=l n=l 


where 

U(6,8 n )= [ u(8„(x?’m)f(x"\0W\ 

J[N=n) 


with 8 n (x")(9) denoting that for each x" the decision rule 8„(x") produces an action, 
which in turn is a function from states to outcomes. 

The counterpart of the Bayes risk is the expected utility of the sequential decision 
procedure d , given by 

UJd) = I U(6, d)n(f))d6. 


Definition 15.4 The Bayes sequential procedure d~ is the procedure which 
maximizes the expected utility U n (d). 



330 


DECISION THEORY: PRINCIPLES AND APPROACHES 


Theorem 15.1 If 8*(x") is a Bayes rule for the fixed sample size problem 
with observations x lt ... ,x„ and utility u(0,a,n), then = (8*, 5?,...) is a Bayes 
sequential decision rule. 


Proof: Without loss of generality, we assume that P g (N — 0) = 0: 


UAd) = [ U(6,d)jt(6)d9 

Je 



u{6,n)f(x n \0)7t(0)dx n d0. 


If we interchange the integrals on the right hand side of the above expression we 
obtain 


UJd) = Y\ [ [ h(M„(*") 

i Je 


n (x”), n)f(x n \d)7t(d)d0dx". 
Note that/(x"|0)7r($) = 7t(6\x")m(x"). Using this fact, let 


J{N=n) Je 


JV=n(*^n) — 


I <0, 

Je 


8„(x”),n)7t(6\x n )de 


denote the posterior expected utility at stage n. Thus, 


It Ad) 




^ I U^ N=n (8Ax H ))m(AW'. 
, J{N=n) 


The maximum of U n (d) is attained if, for each x", 8„(x") is chosen to maximize 
the posterior expected utility lA Jt „ -N=n (8„(x’ v )) which is done in the fixed sample size 
problem, proving the theorem. □ 


This theorem tells us that, regardless of the stopping rule used, once the exper¬ 
iment is stopped, the optimal decision is the formal Bayes rule conditional on the 
observations collected. This result echoes closely our discussion of normal and 
extensive forms in Section 12.3.1. We will return to the implications of this result 
in Section 15.6. 


15.3.3 Bayes truncated procedure 

Next we discuss how to determine the optimal stopping time. We will consider 
bounded sequential decision procedures d (DeGroot 1970) for which there is a posi¬ 
tive number k such that P{N < k) = 1. The technique for solving the optimal stopping 
rule is backwards induction as we previously discussed in Chapter 12. 

Say we carried out n steps and observed x". If we continue we face a new sequen¬ 
tial decision problem with the same terminal decision as before, the possibility of 
observing x„ +1 ,x„ +2 , ..., and utility function u(a(9)) — C(j), where j is the number of 
additional observations. Now, though, our updated probability distribution on 6, that 
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is 77>, will serve as the prior. Let V" be the class of all proper sequential decision 
rales for this problem. The expected utility of a rule d in T)" is U n (d), while the 
maximum achievable expected utility if we continue is 

W(jt^,ri) = sup U^(d). 

dzT> n 

To decide whether or not to stop we compare the expected utility of the continuation 
problem to the expected utility of stopping immediately. At time n, the posterior 
expected utility of collecting an additional 0 observations is 

VTo(7t>,«) = sup / [u(a(9)) — C(n)]tt x n(9)d9. 
a J(~) 

Therefore, we would continue sampling if W(n x n,n ) > W 0 (7T x n, n). 

A catch here is that computing W{n x n,ri) may not be easy. For example, dynamic 
programming tells us we should begin at the end, but here there is no end. An easy 
fix is to truncate the problem. Unless you plan to be immortal, you can do this with 
little loss of generality. So consider now the situation in which the decision maker 
can take no more than k overall observations. Formally, we will consider a subset "D" 
of V", consisting of rules that take a total of k observations. These rules are all such 
that ^.(.U) = 1. Procedures d k with this feature are called k-truncated procedures. 
We can define the k-truncated value of the experiment starting at stage n as 

= sup U njl (d). 

Given that we made it to n and saw x ", the best we could possible do if we continue 
for at most k — n additional steps is n). Because is a subset of T>", the 

value of the A:-truncated problem is non-decreasing in k. 

The Bayes k-truncated procedure d k is a procedure that lies in T>° and for which 

ZTry, (d k ) = W*(tv,0). 


Also, you can think of W(jx^,n) as Woo(7V, n). 

The relevant expected utilities for the stopping problem can be calculated by 
induction. We know the value is W Q (n x n,n) if we stop immediately. Then we can 
recursively evaluate the value of stopping at any intermediate stage between n and 
k by 

Wi (jt x ii , n) 

B4_„0>,n) = max { W 0 {n x n ,n), £' %+1 |v[W*_„_ 1 (^+i,n + 1)]}. 

Based on this recursion we can work out the following result. 


= max { Wain,,, , n), E Xn+l lx >, [ W 0 (7T x n+i , n + 1) ]} 
= max {Wo(7Tyi ,n), E Xit+1 ^[W l (Jt x n +1 ,n+ 1)1} 
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Theorem 15.2 Assume that the Bayes rule 6* exists for all n, and Wj(jc x »,n) are 
finite for j < k,n < ( k—j ). The Bayes k-truncated procedure is given by a = (f *, 5*) 
where £* is the stopping rule which stops sampling at the first n such that 

Wo(7Z>,«) = W k _ n (7t x n,n). 

This theorem tell us that at the initial stage we compare Wo(7r,0), associated 
with an immediate Bayes decision, to the overall k-truncated Wfirt, 0). We continue 
sampling (observe Xi) only if W Q (jx, 0) < W k (jx, 0). Then, if this is the case, after X\ 
has been observed, we compare W 0 (n x i, 1) of an immediate decision to the (k — 1)- 
truncated problem W k _fn x \ , 1) at stage 1. As before, we continue sampling only if 
W 0 (tt x i , 1) < W k -fTt x i, 1). We proceed until W 0 (jz x n,n) = W k -, I {n x n,n). 

In particular, if the cost of each observation is constant and equal to c, we have 
C(n) = nc. Therefore, 


W 0 (jt x „,n) = sup / [ u(a(9 )) — nc]jt x n{G)dO 

a Je 

= supW v ,(a) - nc 

a 

= VoiTT^,) - nc. 


It follows that 


Wi{it x n,n) = max { V n {n x n), E Xn+] | x *[Vo(77>+i)] - c} - nc. 

If we define, inductively, 

VjiTt^) = max iv[Vy_i(n>+i)] - c} 

then Wj(jT x f,n) — Vfit ^) — nc. 

Corollary 2 If the cost of each observation is constant and equal to c, the Bayes 
k-truncated stopping rule, £ k , is to stop sampling and make a decision for the first 
n < k for which 

Vo(?t>) = V k -„(7T x n). 


We move next to applications. 


15.4 Examples 

15.4.1 Hypotheses testing 

This is an example of optimal stopping in a simple hypothesis testing setting. It is 
discussed in Berger (1985) and DeGroot (1970). Another example in the context 
of estimation is Problem 15.1. Assume that x k ,x 2 ,... is a sequential sample from a 
Bernoulli distribution with unknown probability of success 9. We wish to test H 0 : 
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Table 15.1 Utility function for the sequential 
hypothesis testing problem. 



True value of 9 



CO 

II 

= 2/3 

a 0 

0 

-20 

ai 

-20 

0 


0 = 1/3 versus H x : 9 = 2/3. Let a, denote accepting H h Assume u(9,a,n) = 
u(a(9)) — nc, with sampling cost c = 1 and with “0-20” terminal utility: that is, no 
loss for a correct decision, and a loss of 20 for an incorrect decision, as in Table 15.1. 
Finally, let tt 0 denote the prior probability that H 0 is true, that is Pr(6 — 1/3) = 
tTq. We want to find d\ the optimal procedure among all those taking at most two 
observations. 

We begin with some useful results. Let y„ = Yl'Li x <- 


(i) The marginal distribution of x" is 


mOf) =f (jc"|0 = 1/3)jt 0 +/(x"|0 = 2/3)(l - jt 0 ) 

= (l/3)- v "(2/3)"“' , "^o + (2/3> v "(l/3)"~- v ' , (l - 7T 0 ) 
= (1/3)" [2"^"7To + 2 y " (1 - 7T„)] . 


(ii) The posterior distribution of 9 given x is 


it {9 = l/3|x") = 


TT o 

7To + 2 2 »“ n (l - TTo) 


tt(6 = 2/3|x") = 1 - jt(9 = l/3|x"). 


(iii) The predictive distribution of x n+1 given x" is 


m(x n+l |x") 


m(x J,+l ) 

m(x") 

' 1 2" +1_v "7r 0 + 2 yn (1 - 7T 0 ) 
3 2"^» 7 r 0 + 2>’"(l-7T 0 ) 

1 2"~ y "jr 0 + 2 y "+\l - 7T 0 ) 


if x„ +l = 0 


3 2"~ yn 7T 0 + 2- v " (1 — TTq) 


if x„+i = L 
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Before making any observation, the expected utility of immediately making a 
decision is 


W 0 (n 0 , 0) = VoCtto) = max 

a 


Y, u{a(6))jx{0) 
. e 


= max{—20(1 — n 0 ), —207ro} 


| —20^o if 0 < 7T 0 < 1/2 

{ —20(1 — 7T 0 ) if 1/2 < 7T 0 < 1. 


Therefore, the Bayes decision is a 0 if 1/2 < jt 0 < 1 and a, if 0 < n a < 1/2. 
Figure 15.2 shows the V-shaped utility function V<, highlighting that we expect to be 
better off at the extremes than at the center of the range of jt 0 in addressing to choice 
of hypothesis. 

Next, we calculate V 0 (jr x i), the value of taking an observation and then stopping. 
Suppose that i'i has been observed. If — 0, the posterior expected utility is 


VoiiriOlXi = 0 )) = 


' — 40jr 0 

1 + 7T 0 
-20(1 - 7Tq) 

l+^o 


if 0 < 7r 0 < 1/3 
if 1/3 < tc 0 < 1. 



Figure 15.2 Expected utility functions V 0 (solid), V { (dashed), and V 2 (dot—dashed) 
as functions of the prior probability tc 0 . Adapted from DeGroot (1970). 
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If Xj = 1 the posterior expected utility is 


Vo(tt = 1)) = 


' — 207T 0 
2 — 7T 0 

—40(1 — 7Tq) 

2 — 710 


if 0 < 7r 0 < 2/3 
if 2/3 < tt 0 < 1. 


Thus, the expected posterior utility is 


E XI [V 0 (7r x i)] = V 0 (jt(d\xi = 0))m(0) + VoiniO^i = l))m(l) 

—207r 0 if 0 < jr 0 < 1/3 

= -20/3 if 1/3 < 7T 0 < 2/3 

—20(1 — 7r 0 ) if 2/3 <7r 0 < 1. 

Using the recursive relationship we derived in Section 15.3.3, we can derive the 
utility of the optimal procedure in which no more than one observation is taken. 
This is 


Vi(tt 0 ) = max {y 0 (7r 0 ), E X1 [V 0 (jt x i )] - 1} 

—207T 0 if 0 < 7r 0 < 23/60 

= -23/3 if 23/60 < tt 0 < 37/60 

-20(1 -TTo) if 37/60 < 7T 0 < 1. 


If the prior is in the range 23/60 < 7T 0 < 37/60, the observation of X[ will potentially 
move the posterior sufficiently to change the optimal decision, and to do so with 
sufficient confidence to offset the cost of observation. 

The calculation of V 2 follows the same steps and works out so that 


E X1 [V= 


— 20 jt 0 

if 0 < 7T 0 < 23/97 

-(83tt 0 + 23)/9 

if 23/97 < 7T 0 < 37/83 

-20/3 

if 37/83 < 7T 0 < 46/83 

-(106 - 83tt 0 )/9 

if 46/83 < 7T 0 < 74/97 

-20(1 - TTo) 

if 74/97 < Tto < 1, 


which leads to 


V 2 (7To) 


max {Vo(7r 0 ), [U, (jt x i )] - 1} 

-207T0 if 0 < TTo < 32/97 


-(83tt 0 + 32)/9 
-23/3 

—(115 — 83tt 0 )/9 
-20(1 - TTo) 


if 32/97 < TTo < 37/83 
if 37/83 < tto < 46/83 
if 46/83 < tto < 65/97 
if 65/97 < tto < 1. 
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Because of the availability of the second observation, the range of priors for which it 
is optimal to start sampling is now broader, as shown in Figure 15.2. 

To see how the solution works out in a specific case, suppose that it 0 = 0.48. With 
this prior, a decision with no data is a close call, so we expect the data to be useful. 
In fact, V 0 (n 0 ) = — 9.60 < V 2 (tto) = — 7.67, so we should observe Xi. If X] = 0, we 
have V Q (n x i)= V, (jt x \ ) = — 7.03. Therefore, it is optimal to stop and choose a 0 . If 
Xi = 1, V 0 {ir x i) = V, (jt x i ) = — 6.32 and, again, it is optimal to stop, but we choose a.\. 

15.4.2 An example with equivalence between sequential and 
fixed sample size designs 

This is another example from DeGroot (1970), and it shows that in some cases there 
is no gain from having the ability to accrue observations sequentially. 

Suppose that x t ,x 2 , ...is a sequential sample from a normal distribution with 
unknown mean 0 and specified variance 1 /u>. The parameter co is called precision 
and is a more convenient way to handle this problem as we will see. Assume that 
0 is estimated under utility u(ci(Q)) = —(0 — a) 2 and that there is a fixed cost c per 
observation. Also, suppose that the prior distribution n of 6 is a normal distribution 
with mean /i and precision h. Then the posterior distribution of 6 given x" is nor¬ 
mal with posterior precision h + nco. Therefore, there is no uncertainty about future 
utilities, as 


1 



1 


W 0 iTt^,n) = — Var[0|x' 1 ] — nc = - 

h + nco 


— nc 


and so Vo( 7 v) = — 1 /(h+nco) for n = 0, 1,2,.... This result shows that the expected 
utility depends only on the number of observations taken and not on their observed 
values. This implies that the optimal sequential decision procedure is a procedure in 
which a fixed number of observations is taken, because since the beginning we can 
predict exactly what our state of information will be at any stage in the process. Try 
Problem 15.3 for a more general version of this result. 

Exploring this example a bit more, we discover that in this simple case we 
can solve the infinite horizon version of the dynamic programming solution to the 
optimal stopping rule. From our familiar recursion equation 


Vi(7v) = max{V 0 (7r,<n), E Xn+i ^[V 0 (7i>+i)] - c} 


1 


1 


= max 


h + nco ’ h + (n + l)co 


— c 


1 


1 


h + nco 


< 


h + (« + 1 )co 


+ c 


Now 
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whenever 


w < c(h + nco)(h + (n + l)o>). 

Using this fact, and an inductive argument (Problem 15.4), one can show 
that Wj(jc,ri) = — 1 /(h + na>) — nc for /3„ < co </3 ll+i where /3„ is defined as 
/3 0 — 0, /6„ = c(/i + (n — 1 )co){h + nco ) for n = 1,... ,j, and /3 j+l = oo. This implies 
that the optimal sequential decision is to take exactly n* observations, where n* is an 
integer such that /3„» < a> < /1„* +1 . If the decision maker is not allowed to take more 
than k observations, then if n* > k, he or she should take k observations. 


15.5 Sequential sampling to reduce uncertainty 

In this section we discuss sampling rules that do not involve statements of costs 
of experimentation, following Lindley (1956) and DeGroot (1962). The decision 
maker’s goal is to obtain enough information about the parameters of interest. Exper¬ 
imentation stops when enough information is reached. Specifically, let VJ£) denote 
the observed information provided by observing x in experiment £ as defined in 
Chapter 13. Sampling would continue until V x »(£) > 1/e, with e chosen before 
experimentation starts. In estimation under squared error loss, for example, this 
would mean sampling until Var[(9 |T'] < e for some e. 

Example 15.1 Suppose that 0 = {$o,@i} and let 7t = tx( 0 o ) and = n x n(6 0 ). 
Consider the following sequential procedure: sampling continues until the Shannon 
information of is at least e. In symbols, sampling continues until 

— tx x h log(jr v «) + (1 — 7ryi)l°g(l ~ Tt») > e. (15.7) 

For more details on Shannon information and entropy see Cover and Thomas (1991). 
One can show that Shannon’s information is convex in ir x n. Thus, inequality (15.7) is 
equivalent to sampling as long as 


A' <71^ < B’, 


(15.8) 


with A' and B' satisfying the equality in (15.7). If n x n < A', then we stop and accept 
//,; if 7 r x n > B' we stop and accept H 0 . 

Rewriting the posterior in terms of the prior and likelihood, 

_ _ Jr/C*"|flo) _ 

~ 7tf(x"\e 0 ) + (1 - 7t)f(x n \ 0 ]) 

This implies that inequality (15.8) is equivalent to 


1 — Jr f(x"\0\) 
77 f(x" |0 O ). 


(15.9) 


A < 




< B, 


firm 


(15.10) 
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with 


A = 


7r 


■ IT 


1 -B' 
B' 


and B = 


7r 


■ t r 


1 - A' 


The procedure just derived is of the same form as Wald’s sequential probability 
ratio test (Wald 1947b), though in Wald the boundaries for stopping are determined 
using frequentist operating characteristics of the algorithm. ★ 


Example 15.2 Let Xi,x 2 ,. .. be sequential observations from a Bernoulli distribu¬ 
tion with unknown probability of success 0. We are interested in estimating 6 under 
the quadratic loss function. Suppose that 0 is a priori Beta(a 0 , ft 0 ), and let a„ and 
ft n denote the parameters of the beta posterior distribution. The optimal estimate of 
9 is the posterior mean, that is 5* = «„/(«„ + ft,,). The associated Bayes risk is the 
posterior variance Var [0 |jc"] = a„ft n /[(a„ + ft„) 2 (ot„ + ft,, + 1)]. 

Figure 15.3 shows the contour plot of the Bayes risk as a function of a„ and 
ft,,. Say the sampling rule is to take observations as long as Var[0|x n ] > e = 0.02. 
Starting with a uniform prior (a = ft = \ ), the prior variance is approximately 
equal to 0.08. The segments in Figure 15.3 show the trajectory of the Bayes risk 
after observing a particular sample. The first observation is a success and thus 
a n = 2, ft,, — 1, so our trajectory moves one step to the right. After n = 9 observations 



Figure 15.3 Contour plot of the Bayes risk as a function of the beta posterior 
parameters a„ (labeled a) and ft,, (labeled b). 
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with x = (1,0,0,1,1,1,0,1,1) the decision maker crosses the 0.02 curve, and stops 
experimentation. 

For a more detailed discussion of this example, including the analy¬ 
sis of the problem using asymptotic approximations, see Lindley (1956) and 
Lindley (1957). ★ 

Example 15.3 Suppose that x t ,x 2 ,... are sequentially drawn from a normal distri¬ 
bution with unknown mean 9 and known variance er 2 . Suppose that a priori 9 has 
N(0, 1) distribution. The posterior variance Var[0|x' i ] = er 2 /(« + a 2 ) is independent 
of x". Let 0 < e < 1. Sampling until Var[0|x"] < e is equivalent to taking a fixed 
sample size with n >a 2 (l-e)/e. 

15.6 The stopping rule principle 

15.6.1 Stopping rules and the Likelihood Principle 

In this section we return to our discussion of foundation and elaborate on Theo¬ 
rem 15.1, which states that the Bayesian optimal decision is not affected by the 
stopping rule. The reason for this result is a general factorization of the likelihood: for 
any stopping rule £ for sampling from a sequence of observations x u x 2 , ... having 
fixed sample size parametric model/(x"|«, 9) = f(x"\9), the likelihood function is 

n -1 

/(«,*" \9) = £„(*„) nu - SfaiWOnO) oe/Oem 9e&, (15.11) 

i=i 

for all (n,x") such that/(«,x"|0) ^ 0. 

Thus the likelihood function is the same irrespective of the stopping rule. The 
stopping time provides no additional information about 9 to that already contained 
in the likelihood function f(x"\9) or in the prior a fact referred to as the stop¬ 
ping rule principle (Raiffa and Schlaifer 1961). If one uses any statistical procedure 
satisfying the Likelihood Principle the stopping rule should have no effect on the 
final reported evidence about 9 (Berger and Wolpert 1988): 

The stopping rule principle does not say that one can ignore the stopping 
rule and then use any desired measure of evidence. It does say that rea¬ 
sonable measures of evidence should be such that they do not depend on 
the stopping rule. Since frequentist measures do depend on the stopping 
rule, they do not meet this criterion of reasonableness. (Berger and Berry 
1988, p. 45) 

Some of the difficulties faced by frequentist analysis will be illustrated with 
the next example (Berger and Wolpert 1988, pp. 74.1-74.2.). A scientist comes 
to the statistician’s office with 100 observations that are assumed to be indepen¬ 
dent and identically distributed N(9, 1). The scientist wants to test the hypothesis 
H 0 :9 — 0 versus // : 9 ^ 0. The sample mean is x„ = 0.2, so that the test statistic is 
Z = 2 /n\x n — 0| = 2. At the level 0.05, a classical statistician could conclude that there 
is significant evidence against the null hypothesis. 
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With a more careful consultancy skill, suppose that the classical statistician asks 
the scientist the reason for stopping experimentation after 100 observations. If the 
scientist says that she decided to take only a batch of 100 observations, then the 
statistician could claim significance at the level 0.05. Is that right? From a classical 
perspective, another important question would be the scientist’s attitude towards a 
non-significant result. Suppose that the scientist says that she would take another 
batch of 100 observations thus, considering a rule of the form: 

(i) Take 100 observations. 

(ii) If V 100|5ciool > k then stop and reject the null hypothesis. 

(iii) If VTOO|x 100 | < k then take another 100 observations and reject the null 
hypothesis if a/200|x 2 ooI > k. 

This procedure would have level a = 0.05, with £ = 2.18. Since V 100|x 1O0 | = 2 < 
2.18, the scientist could not reject the null hypothesis and would have to take 
additional 100 observations. 

This example shows that when using the frequentist approach, the interpretation 
of the results of an experiment depends not only on the data obtained and the way 
they were obtained, but also on the experimenter’s intentions. Berger and Wolpert 
comment on the problems faced by frequentist sequential methods: 

Optional stopping poses a significant problem for classical statistics, 
even when the experimenters are extremely scrupulous. Honest frequen- 
tists face the problem of getting extremely convincing data too soon (i.e., 
before their stopping rule says to stop), and then facing the dilemma of 
honestly finishing the experiment, even though a waste of time or dan¬ 
gerous to subjects, or of stopping the experiment with the prematurely 
convincing evidence and then not being able to give frequency measures 
of evidence. (Berger and Wolpert 1988, pp. 77-78) 


15.6.2 Sampling to a foregone conclusion 

One of the reasons for the reluctance of frequentist statisticians to accept the stop¬ 
ping rule principle is the concern that when the stopping rule is ignored, investigators 
using frequentist measures of evidence could be allowed to reach any conclusion 
they like by sampling until their favorite hypothesis receives enough evidence. To 
see how that happens, suppose that xi,x 2 ,... are sequentially drawn, normally dis¬ 
tributed random variables with mean 9 and unit variance. An investigator is interested 
in disproving the null hypothesis that 9 = 0. She collects the data sequentially, by 
performing a fixed sample size test, and stopping when the departure from the null 
hypothesis is significant at some prespecified level a. In symbols, this means con¬ 
tinue sampling until |x„| > k a /^/n, where x„ denotes the sample mean and k a is 
chosen so that the level of the test is a. This stopping rule can be proved to be proper. 
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Thus, almost surely she will reject the null hypothesis. This may take a long time, but 
it is guaranteed to happen. We refer to this as “sampling to a foregone conclusion.” 

Kadane et al. (1996) examine the question of whether the expected utility theory, 
which leads to not using the stopping rule in reaching a terminal decision, is also 
prone to sampling to a foregone conclusion. They frame the question as follows: can 
we find, within the expected utility paradigm, a stopping rule such that, given the 
experimental data, the posterior probability of the hypothesis of interest is necessarily 
greater than its prior probability? 

One thing they discover is that sampling to foregone conclusions is indeed 
possible when using improper priors that are finitely, but not countably, additive 
(Billingsley 1995). This means trouble for all the decision rules that, like minimax, 
are Bayes only when one takes limits of priors: 

Of course, there is a... perspective which also avoids foregone conclu¬ 
sions. This is to prescribe the use of merely finitely additive probabilities 
altogether. The cost here would be an inability to use improper pri¬ 
ors. These have been found to be useful for various purposes, including 
reconstructing some basic “classical” inferences, affording “minimax” 
solutions in statistical decisions when the parameter space is infinite, 
approximating “ignorance” when the improper distribution is a limit of 
natural conjugate priors, and modeling what appear to be natural states 
of belief. (Kadane et al. 1996, p. 1235) 

By contrast, though, if all distributions involved are countably additive, then, as 
you may remember from Chapter 10, the posterior distribution is a martingale, so a 
priori we expect it to be stable. Consider any real-valued function h. Assuming that 
the expectations involved exist, it follows from the law of total probability that 


EeW)] = E x [E e [h(0) \x]\. 


So there can be no experiment designed to drive up or drive down for sure the con¬ 
ditional expectation of h, given .r. Let h = / |fl6f)(| | denote the indicator that 6 belongs 
to ©o, the subspace defined by some hypothesis H 0 , and let E e [h(9)] = n(H 0 ) = tc 0 . 
Suppose thatx l ,x 2 , ... are sequentially observed. Consider a design with a minimum 
sample of size k > 0 and define the stopping time N = inf {n > k : 7r(/-/„|jc") > q), 
with N = oo if the set is empty. This means that sampling stops when the posterior 
probability is at least q. Assume that q > ir 0 . Then 


7 r 0 = > Jt(H 0 ,N < oo) 


OO 


= ^2 P(N = n)7t(H Q \N = n) 
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> 


oo 

P(N = n) 

n=k 


/ 

Jx 1 ': 


q dF(x"\n) 


= qP(N < oo), 


where F(x"\n ) is the cumulative distribution of x" given N = n. From the inequality 
above, P(N < oo) < ir n /q < 1. Then, with probability less than 1, a Bayesian 
stops the sequence of experiments and concludes that the posterior probability of 
hypothesis H 0 is at least q. Based on this result, a foregone conclusion cannot be 
reached for sure. 

Now that we have reached the conclusion that our Bayesian journey could not be 
rigged to reach a foregone conclusion, we may stop. 


15.7 Exercises 

Problem 15.1 (Berger 1985) This is an application of optimal stopping in an 
estimation problem. Assume that x 1? x 2 ,... is a sequential sample from a Bernoulli 
distribution with parameter 9. We want to estimate 9 when the utility is u{9, a, n) = 
—(9 — a) 2 — nc. Assume a priori that 9 has a uniform distribution on (0,1). The cost 
of sampling is c = 0.01. Find the Bayes three-truncated procedure d . 


Solution 

First, we will state without proof some basic results: 

(i) The posterior distribution of 9 given x" is a BetaC^' i= i x > +1 > n ~ XT=i -F +1)- 

(ii) Let y„ = ^j" =l x,. The marginal distribution of x" is given by 


m(x") = 


r(y„ + 1)F(« — y n + 1) 

r(« + 2) 


(iii) The predictive distribution of x„ + i given x" is 


m(x n+l \x") = 


m(x' I+1 ) 

m(x") 


n+\-y n 
n + 2 

y, i + l 


if x„ +l = 0, 
if x, 1+ i = 1. 


Now, since the terminal utility is u(ci(9)) = —(9 — a) 2 , the optimal action a* is 
the posterior mean, and the posterior expected utility is minus the posterior variance. 
Therefore, 


W 0 (jr xn ,n) = sup 

a 



( u(a(6 )) — c{n)) n x „(9)d9 


= — Var[0|x"] — O.Oln. 
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Using this result. 


W o (n,0) = —Var[0] = -1/12 « -0.0833, 
W 0 fei, 1) = — Var[@|x‘] - 0.01 = -2/36 - 0.01 


|-0.0700, if >2 = 

|-0.0575, if y 2 = 0ory 2 = 2, 


-0.0655, 


W 0 (n x 3 ,3) 


|—0.0700, if_v 3 = 1 ory 3 = 2 
j—0.0567, ify 3 = 0ory 3 = 3. 


Observe that 


W,fei, n) = msa{W 0 (7t j/ ,,n),E Xn+iW ,[Wj_ l (jv J n+i,n + 1)]} . 

Note that 

E XI [Var[0W]\ = m(0)Var[<9 |jc 3 = 0] + m(l)Var[0|^ = 1] 

= (l/2)(2/36) + (1/2X2/36) = 2/36 = 1/18 
£ t2 l,i[Var[0|x 2 ]] = m(O|;c 1 )Var[0|;i: 1 ,;c 2 = 0] + m(l\x 1 )V<a[0\x l ,x 2 = 1] 

= j(2/3)0/80) + (l/3)(4/80) = 1/24, if y, = 0 
(l/3)(4/80) + (2/3)(3/80) = 1/24, if y, = 1, 


and, similarly. 


Thus, 


fe,2[Var[0|x 3 ]] 


118/600, ify 2 = 0 or y 2 = 2 
[6/150, if y 2 = 1, 


WifeO) = max{ Wo(tt , 0), E XI [ W 0 fei, 1)]} 

= max}—Var(0), -E X1 [Var^x 1 ]] - c} 

= max{ —1/12, -(1/18 + 0.01)} « -0.0655 
Wife, 1) = maxfWofei, V),E xllxi [W 0 fe,2)]} 

= max{Wofei, 1), -fe|^ 1 [Var[0]x 1 ,x 2 ]] - 2 x c} 
= max{-0.0655,-0.0617} = -0.0617 


Wife, 2) = max{Wofe,2),£ X3U 2[W 0 fe,3)]} 

= max{W 0 (7r x 2,2), —£ V3 |_ [ 2[Var[0|x 2 ,x 3 ]] — 3 x c} 

|max{-0.0575, -0.0600} = -0.0575, if y 2 = 0 ory 2 = 2 
~~ I max{-0.0700,-0.0700} = -0.0700, ify 2 = 1. 
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Proceeding similarly, we find. 


W2(7T, 0) = max{VTo(7r,0),£/ 1 [W r i(jr v i, 1)]} = —0.0617 
W 2 (tt x i, 1) = maxfWofe, = -0.0617 

W 3 (jr,0) = max{ W 0 (jr, 0), E X] [W 2 (n x t , 1)]} = —0.0617. 


We can now describe the optimal three-truncated Bayes procedure. Since at stage 
0, W 0 (7r, 0) = —0.083 < W^jx, 0) = —0.0617, x, should be observed. After observing 
Xi, as Wo(tt x i , 1)= — 0.0655 < W 2 (tv x i, 1) = — 0.0617, we should observe x 2 . After 
observing it, since we have W 0 (n x 2,2)= W t (n x 2 , 2), it is optimal to stop. The Bayes 
action is 


s;(x 2 ) 


1/4 if v 2 = 0 
1/2 if y 2 = 1 
3/4 if v 2 = 2. 


□ 


Problem 15.2 Sequential problems get very hard very quickly. This is one of the 
simplest you can try, but it will take some time. Try to address the same question 
as the worked example in Problem 15.1, except with Poisson observations and a 
Gammai 1,1) prior. Set c = 1/12. 

Problem 15.3 (From DeGroot 1970) Let jt be the prior distribution of 0 in a 
sequential decision problem, (a) Prove by using an inductive argument that, in a 
sequential decision problem with fixed cost per observation, if Vo( ti>) is constant for 
all observed values x", then every function V, for j = 1,2,... as well as V 0 has this 
property, (b) Conclude that the optimal sequential decision procedure is a procedure 
in which a fixed number of observations are taken. 

Problem 15.4 (From DeGroot 1970) (a) In the context of Example 15.4.2, for 
n = 0,1,... ,j, prove that 

1 

Wj(n,n) = - — - nc for < w < y6„+i 

n + nco 

where /3„ is defined as /) 0 = 0, /3„ = c(h + (n — 1 )a))(h + nco) for n — 1,...,/ and 
f) j+1 — oc. ( b) Conclude that if the decision maker is allowed to take no more than k 
observations, then the optimal procedure is to take n* < k observations where n* is 
such that /0„* < co < . 
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Notation 

Explanation 

Chapter 

>-, ■< 

preference relations 

2 

~ 

equivalence relation 

2 

Z 

set of outcomes, rewards 

3 

z 

outcome, generic element of Z 

3 

z° 

best outcome 

3 

Zo 

worst outcome 

3 

Q-.-isOk 

outcome from actions a-J*, ... when the 

state of nature is 8^ 

12 

Zi 1 i 2 ...i s k 

outcome from actions a™, ..., a-® when the state 
of nature is 9^ k 

12 

© 

set of states of nature, parameter space 

3 

e 

event, state of nature, parameter, generic element 
of ©, scalar 

2 

6 

state of nature, parameter, generic element of 0, 

7 


vector 


M s ) 

“ot 

MS) 

d i S k 

generic state of nature upon taking stopping action 
df at stage s 

12 

generic state of nature at last stage S upon taking 
action afj 

12 

71 

subjective probability 

2 

TXg 

price, relative to stake S, for a bet on 0 

2 

^6\ \&2 

price of a bet on ()\ called off if 0 2 does not occur 

2 

< 

current probability assessment of Q 

2 

(■ continued overleaf) 
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( Continued) 


Notation 

Explanation 

Chapter 

K 

future probability assessment of 0 

2 

tt m 

least favorable prior 

7 

Se 

stake associated with event 0 

2 

ge 

net gains if event 0 occurs 

2 

A 

action space 

3 

a 

action, generic element of A 

3 

a(9) 

VNM action, function from states to outcomes 

3 

Z 

expected value of the rewards for a lottery a 

4 

Z*(a) 

certainty equivalent of lottery a 

4 

Hz) 

Arrow-Pratt (local) measure of risk aversion at z 

4 

P 

set of probability functions on Z 

6 

a 

AA action, horse lottery, list of VNM lotteries a(9) 
in P for 6 e © 

6 

a(6,z) 

probability that lottery a{9) assigns to outcome z 

6 

a s,e 

action of betting stake S on event 9 

2 

a.g 

action that would maximize decision maker’s 
utility if 9 was known 

13 

a 

action, vector 

7 

(s) 

a) 

Is 

flo’ 

generic action, indexed by i s , at stage 5 of a 
multistage decision problem 

12 

stopping action at stage s of a multistage decision 
problem 

12 


a M 

minimax action 

7 

a* 

Bayes action 

3 

p(z) 

probability of outcome z 

3 

p 

lottery or gamble, probability distribution over Z 

3 

Xz 

degenerate action with mass 1 at reward z 

3 

u 

utility 

3 

Hz) 

utility of outcome z 

3 

u{a(9)) 

utility of outcome a(9) 

3 

u e (z) 

state-dependent utility of outcome z 

6 

S{q) 

expected score of the forecast probability q 

10 

s(9,q) 

scoring rule for the distribution q and event 9 

10 

RP(ci) 

risk premium associated with action a 

4 

X 

sample space 

7 

X 

random variable or observed outcome 

7 

x* 1 

random sample (x,,... ,x„) 

14 

X 

multivariate random sample 

7 

(s) 

random variable with possible values x n ,... ,x iJs 
observed at stage s, upon taking continuation 
action a\ s) , i > 0 

l s 7 

12 
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( Continued ) 


Notation 

Explanation 

Chapter 

me) 

probability density function of x or likelihood 
function 

7 

m(x) 

marginal density function of x 

7 

F(x\6) 

distribution function of x 

7 

tt(6) 

prior probability of 0; may indicate a density if 9 
is continuous 

7 

tz(6\x), 7t x 

posterior density of 6 given x 

7 

EAg(x)] 

expectation of the function g(x) with respect to 
m(x) 

7 

Ex\e\g(x)], E[g(x)\6] 

expectation of the function g(x) with respect to 

me) 

7 

UJa) 

expected utility of action a, using prior tt 

3 

UJd) 

expected utility of the sequential decision proce¬ 
dure d 

15 

8 

decision rule, function with domain X and range 

A 

multivariate decision rule 

7 

8 

7 

8 M 

minimax decision rule 

7 

8 * 

Bayes decision rule 

7 

8 R (x, •) 

randomized decision rule for a given x 

7 

8 n 

terminal decision rule after n observations x" 

15 

8 

sequence of decision rules 5i(x‘), ^(x 2 ),... 

15 


stopping rule after n observations x" 

15 

S 

sequence of stopping rules £ 0 , £i(x'), ■ ■ ■ 

15 

d 

sequential decision rule: pair (8, £) 

15 

L(9,a) 

loss function (in regret form) 

7 

L„(0,a ) 

loss function as the negative of the utility function 

7 

L(9,a,n) 

loss function for n observations 

14 

u(9, a, n ) 

utility function for n observations 

14 

L(9,S R (x)) 

loss function for randomized decision rule 8 R 

7 

£Ja) 

prior expected loss for action a 

7 

EnM) 

posterior expected loss for action a 

7 

U(6,d ) 

average utility of the sequential procedure d for 
given 9 

15 

R(9,8 ) 

risk function of decision rule 8 

7 

V(n) 

maximum expected utility with respect to prior n 

13 

W 0 (n x „,n) 

posterior expected utility, at time n, of colleting 0 
additional observations 

15 


posterior expected utility, at time n, of continuing 
for at most an additional k — n steps 

15 


(continued overleaf ) 
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( Continued ) 


Notation 

Explanation 

Chapter 

r(it,8) 

Bayes risk associated with prior tt and decision S 

7 

r{n , n ) 

Bayes risk adopting the optimal terminal decision with 
n observations 

14 

V 

class of decision rules 

7 

V R 

class of all (including randomized) decision rules 

7 

£°° 

perfect experiment 

13 

£ 

generic statistical experiment 

13 


statistical experiment consisting of observing both vari¬ 
ables x x and x 2 

13 

£(n) 

statistical experiment consisting of experiments 
£ u £ 2 ,..., £„ where £ 2 , ...,£„ are conditionally i.i.d. 
repetitions of £, 

13 

V e (£°°) 

conditional value of perfect information for a given 9 

13 

V(£°°) 

expected value of perfect information 

13 

VJ£) 

observed value of information for a given x in experi¬ 
ment £ 

13 

V{£) 

expected value of information in the experiment £ 

13 

V(£n) 

expected value of information in the experiment £ n 

13 

V(£ 2 \£ t ) 

expected information of x 2 conditional on observed X\ 

13 

1A£) 

observed (Lindley) information provided by observing 
x in experiment £ 

13 

W) 

expected (Lindley) information provided by the exper¬ 
iment £ 

13 

U£u) 

expected (Lindley) information provided by the exper¬ 
iment £ n 

13 

I(£ 2 |£i) 

expected (Lindley) information provided by E 2 condi¬ 
tional on Ei 

13 

C 

cost per observation 

14 

C(n) 

cost function 

14 

K 

Bayes rule based on n observations 

14 

UAn) 

expected utility of making n observations and adopting 
the optimal terminal decision 

14 

rf 

Bayesian optimal sample size 

14 

n M 

Minimax optimal sample size 

14 

9 

nuisance parameter 

7 

cp(.) 

cumulative distribution function of the A^IOJ) 

8 

M 

class of parametric models in a decision problem 

11 

M 

generic model within M 

11 

n. 

history set at stage s of a multistage decision problem 

12 
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A.2 Relations 



P(z) 

= 


(3.3) 

z 

= 

J2- zp(z) 

(4.4) 

z*(a ) 

= 

U 1 

(£= 

u(z)p(z)) 

(4.9) 

RP(a ) 

= 

z - 

z*(a ) 

(4.10) 

Mz) 

= 

—u 

"(z)/m'(z) = —{d/dz)(\og u'(z)) 

(4.13) 

L„(0, a) 

= 

—u 

(am 

(7.1) 

L(9,a) 

= 

L„(0,a ) 

- inf aeA L„(9, a) 

(7.2) 


= 

SU Pa'(0) f 

(a’m - u{a{9)) 

(7.3) 

CJa) 

= 

f s L(9,a)n(9)d9 

(7.6) 

EnM) 

= 

f e L(6, a)Tc(6\x)d0 

(7.13) 

UAa) 

= 

/ 0 u(a(9))jz(9)d9 

(3.4) 


= 

T, xeZ p(z)u(z) 

(3.5) 

S(q) 

= 


=1 s(9j, q)TZj 

(10.1) 

V(n) 

= 

sup a U n (a) 

(13.2) 

a M 

= 

argminmaxo L(0,a) 

(7.4) 

a* 

= 

argmax U n (a) 

(3.6) 


= 

argmin f L{9 , a)jt (9)d9 

(7.5) 

a g 

= 

argsup fl 

sA u(a(9)) 

(13.4) 

m(x ) 

= 

Is 

n(9)f(x\9)d9 

(7.12) 

7t(0|jc) 

= 

Jt(9)f(x\9)/m(x ) 

(7.11) 

L(9,8 R (x)) 

= 

E S R( X ) L(9,a) = f asA L(9,a)8 R (x,a)da 

(7.16) 

R(9, 8) 

= 

f x L{9,8)f(x\9)dx 

(7.7) 

r(jt,8) 

= 

f 0 R(9,8)jr(9)d9 

(7.9) 

8 M 

S.t. 

sup„ R(9,8 M ) = infj sup,, R(9,8) 

(7.8) 

8* 

s.t. 

r{n,8*) 

= inf s r(n,8) 

(7.10) 

v e (£n 

= 

u(a g (9)) 

— u(a*(9)) 

(13.5) 

V(£°°) 

= 

E e [V e (£ 

“)] 

(13.6) 


= 

Eg 

sup„ u(a(9))] - V(n) 

(13.7) 

V,(£) 

= 

V(tt x ) - 

Ux x (a*) 

(13.9) 

V(£) 

= 

E x [V x (£)] = E x [V(rt x )-U Jtx (a t )\ 

(13.10) 


= 

E[V x m = E x [V(tt x )\ - V(i r) 

(13.12) 


= 

E x [V(7t x )] - V(E x [ 7T x ]) 

(13.13) 

V(S2\£i) 

= 

E XlX2 [V(n xlX2 )] - E Xl [V(n X] )] 

(13.17) 

V(£ n ) 

= 

E XIX2 [V(n xlxl )] - V(n) 

(13.16) 


= 

V(£,) + V(&|£,) 

(13.18) 

K£) 

= 

fx lo § (jl x {e)/lt{e)) Jt x (9)m(x)d9dx 

(13.27) 


= 

Is 

f x log (f(x\6)/m(x)) f(x\9)n(9)dxd9 

(13.28) 


= 

E x 

Eg \ x 

log (n x (9)/nm\\ 

(13.29) 


= 

E x 

Eq\x 

log (f(x\9)/m(x))\] 

(13.30) 


= 

Eh 

Ex 2 IJ 1 

E s \ xi ,x 2 [log (f(x2\9,xi')/m(x 2 \x ] ))] 

(13.33) 

K£n) 

= 

1(&) + I(.£2\£i) 

(13.31) 

U£) 

= 

Is 

n x (9) log (jT x (0)/7tm d9 

(13.26) 
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u(a,9,ri) = u{a{9)) - C(n) (14.1) 

L(0,a,ri) = L(6,a) + C(ri) (14.2) 

r(jt,n) = r(7T,S*)+ C(ii) (14.3) 

= f x f 0 L(9 , <5*Ox"), n) tt(6> | xP)dO m(x n )dx n (14.4) 

lA^in) = f x f 0 u(9,S*(x"),n) jt(9\x")d9 m(x")dx" (14.5) 


A.3 Probability (density) functions of some 
distributions 

1. If x ~ Bin(n,G ) then 


mo) = ( )o x a - ey-\ 


2. If x ~ iV(0, cr 2 ) then 


/(x|0,a 2 ) = 


a/2jtc 


exp 


(x — 0) 2 
2a 2 


3. If x ~ Gammaia, ft) then 


4. If x ~ Exp(6) then 


f{x\a,P)= 

r(a) 


/(x|0) = 9e^\ 


5. If x ~ Beta — Binomial(n, a, fi ) then 

_ l)r(a + x)F(/j + n — x) 

r(a)r(j6)F(x+ l)r(n — x + l)r(a + P + n) 

6. If x ~ Beta(a, ft ) then 


m(x) = 


r(« + P) 

r(a)r(/3) 


x“(l -xf. 


A.4 Conjugate updating 

In the following description, prior hyperparameters have subscript 0, while parame¬ 
ters of the posterior distributions have subscript x. 

1. Sampling model: binomial 

(a) Data: x\0 ~ Bin(n,9). 

(b) Prior: 9 ~ Beta(a 0 , fi 0 ). 
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(c) Posterior: 0\x ~ Beta(a x , fi x ) where 


a x = a Q + x and fi x = + n — x. 


(d) Marginal: x ~ BetaBinomial(n, a 0 , /3 0 ). 
2. Sampling model: normal 

(a) Data: x ~ N(6, a 2 ) (a 2 known). 

(b) Prior: 6 ~ iy(/x 0 , r 0 2 ). 

(c) Posterior: 0|x ~ N(fx x , x 2 ) where 



x and r 2 



■o 


(d) Marginal: x ~ N(fi 0 , a 2 + x 2 ). 
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